Exploratory Data Analysis
\(\hspace{0.3cm}\) More articles: \(\hspace{0.1cm}\) Estadistica4all
\(\hspace{0.3cm}\) Author: \(\hspace{0.1cm}\) Fabio Scielzo Ortiz
\(\hspace{0.3cm}\) If you use this article, please, reference it:
\(\hspace{0.5cm}\) Scielzo Ortiz, Fabio. (2023). Exploratory Data Analysis. http://estadistica4all.com/Articulos/EDA.html
It’s recommended to open the article on a computer or tablet.
1 Exploratory Data Analysis (EDA)
Exploratory data analysis (EDA) refers to the descriptive statistical analysis of a data-set.
Next we are going to propose a methodology to carry out an EDA, using Python as programming lenguage.
2 Data Pre-processing
2.1 Import data-set
First of all, we import the data-set with which we will work.
import pandas as pd
Netflix_Data = pd.read_csv('titles.csv')Netflix_Data| id | title | type | description | release_year | age_certification | runtime | genres | production_countries | seasons | imdb_id | imdb_score | imdb_votes | tmdb_popularity | tmdb_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ts300399 | Five Came Back: The Reference Films | SHOW | This collection includes 12 World War II-era p… | 1945.0 | TV-MA | 51 | [‘documentation’] | [‘US’] | 1.0 | NaN | NaN | NaN | 0.600 | NaN |
| 1 | tm84618 | Taxi Driver | MOVIE | A mentally unstable Vietnam War veteran works … | 1976.0 | R | 114 | [‘drama’, ‘crime’] | [‘US’] | NaN | tt0075314 | 8.2 | 808582.0 | 40.965 | 8.179 |
| 2 | tm154986 | Deliverance | MOVIE | Intent on seeing the Cahulawassee River before… | 1972.0 | R | 109 | [‘drama’, ‘action’, ‘thriller’, ‘european’] | [‘US’] | NaN | tt0068473 | 7.7 | 107673.0 | 10.010 | 7.300 |
| 3 | tm127384 | Monty Python and the Holy Grail | MOVIE | King Arthur, accompanied by his squire, recrui… | 1975.0 | PG | 91 | [‘fantasy’, ‘action’, ‘comedy’] | [‘GB’] | NaN | tt0071853 | 8.2 | 534486.0 | 15.461 | 7.811 |
| 4 | tm120801 | The Dirty Dozen | MOVIE | 12 American military prisoners in World War II… | 1967.0 | NaN | 150 | [‘war’, ‘action’] | [‘GB’, ‘US’] | NaN | tt0061578 | 7.7 | 72662.0 | 20.398 | 7.600 |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 5845 | tm1014599 | Fine Wine | MOVIE | A beautiful love story that can happen between… | 2021.0 | NaN | 100 | [‘romance’, ‘drama’] | [‘NG’] | NaN | tt13857480 | 6.8 | 45.0 | 1.466 | NaN |
| 5846 | tm898842 | C/O Kaadhal | MOVIE | A heart warming film that explores the concept… | 2021.0 | NaN | 134 | [‘drama’] | [] | NaN | tt11803618 | 7.7 | 348.0 | NaN | NaN |
| 5847 | tm1059008 | Lokillo | MOVIE | A controversial TV host and comedian who has b… | 2021.0 | NaN | 90 | [‘comedy’] | [‘CO’] | NaN | tt14585902 | 3.8 | 68.0 | 26.005 | 6.300 |
| 5848 | tm1035612 | Dad Stop Embarrassing Me - The Afterparty | MOVIE | Jamie Foxx, David Alan Grier and more from the… | 2021.0 | PG-13 | 37 | [] | [‘US’] | NaN | NaN | NaN | NaN | 1.296 | 10.000 |
| 5849 | ts271048 | Mighty Little Bheem: Kite Festival | SHOW | With winter behind them, Bheem and his townspe… | 2021.0 | NaN | 7 | [‘family’, ‘animation’, ‘comedy’] | [] | 1.0 | tt13711094 | 7.8 | 18.0 | 2.289 | 10.000 |
5850 rows × 15 columns
2.2 Data-set conceptual description
This data-set has information about 15 variables on 5850 Netflix titles.
Next table has a brief conceptual description about data-set variables:
| Variable | Descripción | Tipo |
|---|---|---|
| id | The title ID on JustWatch | Identifier |
| title | The name of the title | Text |
| type | TV show or movie | Categorical |
| description | A brief description | Text |
| release_year | release year | Quantitative |
| age_certification | age rating | Categorical |
| runtime | the number of episodes (show), the duration time in minutes (movie) | Quantitative |
| genres | A list of genres | Categorical |
| production_countries | A list of countries that produced the title | Categorical |
| seasons | Number of seasons if it’s a SHOW | Quantitative |
| imdb_id | The title ID on IMDB | Identifier |
| imdb_score | Rating on IMDB | Quantitative |
| imdb_votes | number of votes on IMDB | Quantitative |
| tmdb_popularity | Popularity on TMDB | Quantitative |
| tmdb_score | Rating on TMDB | Quantitative |
2.3 Data-set size
We can get the data-set size as the number of rows and columns of the data-set.
Netflix_Data.shape(5850, 15)
As discussed above, the data-set has 5850 rows and 15 columns.
2.4 info() method
info() method give us column names, number of non null values in each column and column type.
Netflix_Data.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5850 entries, 0 to 5849
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 5850 non-null object
1 title 5849 non-null object
2 type 5850 non-null object
3 description 5832 non-null object
4 release_year 5850 non-null int64
5 age_certification 3231 non-null object
6 runtime 5850 non-null int64
7 genres 5850 non-null object
8 production_countries 5850 non-null object
9 seasons 2106 non-null float64
10 imdb_id 5447 non-null object
11 imdb_score 5368 non-null float64
12 imdb_votes 5352 non-null float64
13 tmdb_popularity 5759 non-null float64
14 tmdb_score 5539 non-null float64
dtypes: float64(5), int64(2), object(8)
memory usage: 685.7+ KB
2.5 Column types
There is another way to get column types.
Netflix_Data.dtypesid object
title object
type object
description object
release_year int64
age_certification object
runtime int64
genres object
production_countries object
seasons float64
imdb_id object
imdb_score float64
imdb_votes float64
tmdb_popularity float64
tmdb_score float64
dtype: object
Object is the typical type of categorical variables, identifier or text.
Float64 and int64 is the typical type of quantitative variables, float64 for continuous one, and int64 for discrete one.
2.6 Change column types
We can change the type of a column with astype() method.
First, we check type of release_year variable is int64.
Netflix_Data.dtypes['release_year']dtype('int64')
Now, we can change that type to float64 using the following function:
def change_type(Data, Variable_name, New_type):
Data[Variable_name] = Data[Variable_name].astype(New_type)change_type(Data=Netflix_Data, Variable_name='release_year', New_type='float64')We can check if the changes have been done correctly:
Netflix_Data.dtypes['release_year']dtype('float64')
2.7 Unique values of a variable
We can get the uniques values of a variable with the unique() method.
We can get the uniques values of type as following :
Netflix_Data['type'].unique()array(['SHOW', 'MOVIE'], dtype=object)
\(\\\)
We can get the unique values of age_certification as following :
Netflix_Data['age_certification'].unique()array(['TV-MA', 'R', 'PG', nan, 'TV-14', 'PG-13', 'TV-PG', 'TV-Y', 'TV-G', 'TV-Y7', 'G', 'NC-17'], dtype=object)
\(\\\)
We can get the uniques values of production_countries as following :
Netflix_Data['production_countries'].unique() array(["['US']", "['GB']", "['GB', 'US']", "['EG']", "['DE']", "['IN']",
"['SU', 'IN']", "['LB', 'CA', 'FR']", '[]', "['LB']",
"['DZ', 'EG']", "['CA', 'FR', 'LB']", "['US', 'GB']",
"['US', 'IT']", "['JP']", "['AR']", "['FR', 'EG']", "['FR', 'LB']",
"['CA', 'US']", "['US', 'FR']", "['JP', 'US']", "['US', 'CA']",
"['DE', 'US']", "['PE', 'US', 'BR']", "['IT', 'US', 'FR']",
"['IE', 'GB', 'DE', 'FR']", "['HK', 'US']", "['AU']", "['FR']",
"['DE', 'GH', 'GB', 'US', 'BF']", "['MX']", "['ES', 'AR']",
"['CO']", "['PS', 'US', 'FR', 'DE']", "['FR', 'NO', 'LB', 'BE']",
"['BE', 'FR', 'IT', 'LB']", "['TR']", "['IN', 'SU']", "['DK']",
"['CA']", "['DE', 'GB', 'US', 'BS', 'CZ']", "['MT', 'GB', 'US']",
"['AU', 'DE', 'GB', 'US']", "['US', 'JP']", "['BE', 'US']",
"['HK']", "['IT']", "['US', 'FR', 'DE', 'GB']",
"['GB', 'US', 'FR', 'DE']", "['IT', 'US']", "['US', 'ZA']",
"['GB', 'ES']", "['GB', 'US', 'JP']", "['HK', 'CN']",
"['GB', 'US', 'BG']", "['RU']", "['KR']", "['CA', 'US', 'IN']",
"['CN']", "['JP', 'HK']", "['CA', 'GB', 'US']",
"['FR', 'MX', 'ES']", "['IN', 'US']", "['AR', 'ES']", "['CL']",
"['FR', 'MA', 'DE', 'PS']", "['AR', 'DE', 'UY', 'ES']",
"['CL', 'AR']", "['CZ', 'GB', 'DK', 'NL', 'SE']", "['TW']",
"['SG']", "['NG']", "['MY']", "['Lebanon']",
"['BE', 'FR', 'ES', 'CH', 'PS']", "['ZA']", "['NG', 'US']",
"['LB', 'FR']", "['CN', 'HK']", "['PH']", "['LB', 'GB', 'FR']",
"['FR', 'DE', 'KW', 'PS']", "['PS']",
"['GB', 'US', 'AT', 'FR', 'DE', 'NG']", "['XX']", "['AE', 'US']",
"['DK', 'US']", "['FR', 'US', 'GB']", "['HU', 'US', 'CA']",
"['NO']", "['GB', 'FR', 'DE']", "['US', 'HU', 'IT']",
"['US', 'ZA', 'DE']", "['IN', 'DE']", "['SA']", "['ID']",
"['US', 'LB', 'AE']", "['PS', 'NL', 'US', 'AE']",
"['US', 'FR', 'GB']", "['US', 'DE', 'GB']", "['GB', 'ZA']",
"['US', 'CA', 'CL']", "['US', 'GB', 'CN', 'CA']",
"['AU', 'CH', 'GB']", "['ES']", "['FI']", "['IL']", "['FR', 'US']",
"['AU', 'US']", "['CA', 'US', 'GB']", "['AT']", "['CD', 'GB']",
"['US', 'BR']", "['CA', 'JP', 'US']", "['CA', 'KR']",
"['US', 'EG', 'GB']", "['BR']", "['PL']", "['VE', 'AR']", "['RO']",
"['IL', 'NO', 'ZA', 'AE', 'GB', 'IS', 'IE']",
"['US', 'CN', 'DE', 'SG', 'UA']", "['DE', 'IT', 'PS', 'FR']",
"['AE', 'LB']", "['LB', 'AE']", "['US', 'ES']", "['NZ']",
"['GB', 'US', 'FR']", "['US', 'FR', 'LU', 'GB']", "['FR', 'BE']",
"['IT', 'GB']", "['US', 'CA', 'GB']", "['CA', 'FR']",
"['US', 'CN']", "['UA']", "['MX', 'ZA', 'US']",
"['US', 'GB', 'ES']", "['BE', 'DK', 'DE', 'GB', 'US']",
"['GB', 'IR', 'JO', 'QA']", "['CH', 'US']", "['CA', 'DE', 'GB']",
"['GH', 'US']", "['IE', 'GB']", "['CN', 'US']",
"['UA', 'GB', 'US']", "['IE', 'ZA']", "['US', 'FR', 'MT']",
"['BG']", "['GB', 'FR']", "['BY']", "['IE']", "['IS']",
"['AU', 'FR', 'DE']", "['CN', 'FR', 'CA']", "['FR', 'QA']",
"['SE']", "['FR', 'ES']", "['NL']", "['HR']", "['FR', 'MA']",
"['RU', 'US', 'FR']", "['SY', 'GB']", "['AT', 'US']", "['CD']",
"['FR', 'CL']", "['AU', 'GB']", "['TN']", "['AE']", "['SE', 'NO']",
"['GL', 'FR']", "['LB', 'DE']", "['PT', 'SE', 'DK', 'BR', 'FR']",
"['QA', 'LB']", "['GB', 'AU', 'US']", "['ES', 'DK']",
"['AE', 'FR', 'JO', 'LB', 'QA', 'PS']", "['US', 'CA', 'JP']",
"['PK']", "['IN', 'GB']", "['PS', 'FR', 'DE']", "['CZ']",
"['CA', 'NG']", "['VN']", "['NL', 'GB']",
"['CA', 'HU', 'MX', 'ES', 'GB', 'US']", "['FR', 'GB', 'US']",
"['FR', 'NL', 'GB', 'US']", "['CN', 'CA', 'US']", "['CA', 'GB']",
"['KR', 'US']", "['FR', 'RO', 'GB', 'BE', 'DE']", "['US', 'MX']",
"['HK', 'IS', 'US']", "['IN', 'CN', 'US', 'GB']", "['BE', 'FR']",
"['PR', 'US', 'GB', 'CN']", "['GB', 'DE']", "['US', 'PR']",
"['IT', 'CH', 'FR']", "['IT', 'ES', 'FR']", "['US', 'IS', 'NO']",
"['IQ', 'GB']", "['HU']", "['US', 'AU', 'GB']",
"['CZ', 'GB', 'US']", "['US', 'IE', 'CA']", "['TH']",
"['IR', 'US', 'FR']", "['BE']",
"['GB', 'ID', 'CA', 'CN', 'SG', 'US']", "['ES', 'FR']",
"['SG', 'GB', 'US']", "['GE', 'DE', 'FR']", "['CA', 'US', 'DE']",
"['CA', 'IE']", "['NL', 'BE']", "['US', 'KH']", "['FR', 'JP']",
"['PR']", "['US', 'CA', 'CN']", "['CN', 'US', 'ES']",
"['CU', 'US']", "['BG', 'US']", "['US', 'BG']",
"['US', 'DK', 'GB']", "['ES', 'IT']", "['TR', 'US']",
"['PE', 'DE', 'NO']", "['LU', 'US', 'FR']",
"['IL', 'MA', 'US', 'BG', 'GB']", "['AR', 'CL']",
"['AR', 'ES', 'UY']", "['JP', 'CN']", "['US', 'AU']",
"['QA', 'TN', 'FR']", "['ES', 'MX']", "['PH', 'SG']",
"['US', 'AE']", "['DE', 'DK', 'NL', 'GB']", "['NL', 'MX']",
"['CA', 'CN']", "['NO', 'SE', 'DK', 'NL']", "['US', 'DE', 'ZA']",
"['IS', 'SE', 'BE']", "['DE', 'ES']", "['CN', 'FR', 'TW', 'US']",
"['KH']", "['BE', 'FR', 'IT']", "['DE', 'CH']",
"['JP', 'KR', 'FR']", "['DE', 'NZ', 'GB']", "['PE']",
"['MX', 'US']", "['US', 'DK']", "['PL', 'US']", "['KE']", "['GH']",
"['IT', 'CH', 'VA', 'FR', 'DE']", "['PE', 'GB', 'US', 'IL', 'IT']",
"['SA', 'SY', 'AE']", "['US', 'KR']", "['IN', 'FR']",
"['RS', 'PL', 'RU']", "['CL', 'NL', 'FR']", "['IE', 'CA']",
"['US', 'NL']", "['TZ']", "['IT', 'ES']", "['ID', 'MY', 'SG']",
"['FR', 'LU', 'CA']", "['FR', 'QA', 'TN', 'BE']",
"['PL', 'CH', 'AL', 'IT']", "['CZ', 'US']", "['AR', 'FR']",
"['DE', 'IT']", "['IT', 'FR']", "['MX', 'FI']", "['CA', 'BR']",
"['IN', 'MX']", "['BR', 'DK', 'FR', 'DE', 'PL', 'AR']",
"['ZA', 'US', 'CA']", "['ES', 'BE']", "['PY']", "['US', 'NG']",
"['US', 'BE', 'GB']", "['ZW']", "['IT', 'AR']",
"['AT', 'IQ', 'US']", "['GE']", "['AR', 'IT']", "['NG', 'NO']",
"['IS', 'GB']", "['MX', 'CO']", "['AR', 'US']", "['KW']",
"['JP', 'GB']", "['TW', 'US']", "['NP', 'IN']",
"['AU', 'US', 'CN']", "['FR', 'IN', 'SG']", "['LB', 'PS']",
"['JP', 'US', 'CA']", "['CM']", "['BD', 'IN']", "['CA', 'ZA']",
"['FR', 'PS', 'CH', 'QA']", "['NL', 'JO', 'DE']",
"['GB', 'DK', 'GR']", "['MX', 'AR']", "['US', 'CL', 'MX']",
"['KG']", "['CH']", "['BD']", "['LU']", "['ZA', 'GB']",
"['BT', 'CN']", "['CA', 'HU', 'US']", "['BE', 'LT', 'NL']",
"['IT', 'MC', 'US', 'CA']", "['CN', 'US', 'AU', 'CA']",
"['BE', 'SE', 'GB']", "['GB', 'CZ', 'FR']", "['US', 'MW', 'GB']",
"['US', 'CY']", "['BE', 'FR', 'SN']", "['BR', 'FR', 'ES', 'BE']",
"['US', 'CH']", "['US', 'IL']", "['FR', 'LT', 'GB']",
"['GB', 'IE']", "['GB', 'IT']", "['JO', 'TH', 'US', 'AL']",
"['PT', 'US']", "['IL', 'US', 'FR', 'DE']", "['TW', 'MY']",
"['US', 'CA', 'FR', 'ES']", "['FI', 'NO']", "['US', 'FR', 'JP']",
"['GB', 'JP']", "['US', 'CN', 'GB']",
"['US', 'FR', 'SE', 'GB', 'DE', 'DK', 'CA']", "['DE', 'AT']",
"['US', 'TH']", "['PH', 'US']", "['BR', 'MX']", "['NO', 'CA']",
"['CO', 'ES']", "['CN', 'DE', 'GB']", "['NO', 'DE']",
"['ES', 'PT']", "['IL', 'US']", "['ES', 'BE', 'DE']",
"['TH', 'US']", "['US', 'FR', 'ES']", "['ES', 'FR', 'AR']",
"['NL', 'PL', 'UA', 'GB', 'US']", "['QA', 'PS']",
"['RS', 'UY', 'AR']", "['FR', 'IT']", "['CA', 'LK']",
"['US', 'AR']", "['EG', 'US']", "['US', 'IN']",
"['FR', 'LU', 'BE', 'KH']", "['US', 'BE', 'ES']",
"['CA', 'FR', 'JP', 'GB', 'US']", "['AT', 'DE']",
"['US', 'GB', 'DE']", "['FR', 'MX', 'CO']", "['BR', 'FR']",
"['JO']", "['FR', 'IN', 'QA']", "['AR', 'PE']", "['MU']",
"['DE', 'DK', 'EG']", "['US', 'IE']", "['IO']", "['TW', 'CN']",
"['FR', 'NL', 'SG']", "['SN']", "['UY']", "['DE', 'IN', 'AT']",
"['MA', 'FR', 'QA']", "['PS', 'PH']", "['EG', 'SA']",
"['ES', 'CN']", "['CL', 'AR', 'CA']", "['AR', 'CO']",
"['GT', 'UY']", "['AF', 'DE', 'PS']", "['ZA', 'AO']",
"['HK', 'PH']", "['SG', 'MY']", "['SE', 'US']",
"['LB', 'US', 'NL', 'CA']", "['NL', 'PS', 'US', 'LB']",
"['DK', 'LB', 'GB']", "['UY', 'MX', 'ES']", "['PH', 'JP']",
"['CN', 'JP', 'US']", "['NA']", "['LB', 'QA', 'SY', 'FR']",
"['PS', 'DK', 'LB']", "['US', 'CZ']",
"['GB', 'AU', 'CA', 'GR', 'NZ']", "['GR', 'GB', 'US']",
"['DE', 'FR']", "['NL', 'US']", "['AT', 'GB', 'US']",
"['CH', 'DE']", "['GB', 'US', 'DE']", "['DK', 'IS']",
"['FR', 'DE', 'US']", "['US', 'JP', 'TH']", "['FR', 'DE']",
"['RO', 'US']", "['ES', 'KN']", "['SE', 'GB']",
"['SG', 'US', 'IN']", "['DE', 'AU']", "['GB', 'CA']",
"['IE', 'US', 'CA']", "['PT']", "['US', 'PL', 'KR']",
"['LU', 'FR']", "['IT', 'BR']", "['GB', 'HU', 'NL', 'CH']",
"['BR', 'DE', 'QA', 'MX', 'US', 'CH', 'AR']", "['ES', 'PE']",
"['BE', 'GB', 'DE']", "['ZA', 'GB', 'US']", "['CL', 'PE']",
"['CA', 'CN', 'US']", "['SG', 'US']", "['BR', 'US']",
"['BE', 'NL']", "['RU', 'US']", "['ES', 'US']", "['CZ', 'DE']",
"['NZ', 'HK']", "['MA', 'SA', 'TN', 'EG', 'LB']", "['CN', 'GB']",
"['AF']", "['BE', 'LU']", "['BE', 'DE']", "['SE', 'RO']",
"['ZA', 'US']", "['GB', 'IN']", "['HU', 'CA']", "['NG', 'CA']",
"['TZ', 'GB']", "['PH', 'FO']"], dtype=object)
2.8 NaN identification
A NaN is a not a number value. NaN is equivalent to missing value.
We are going to calculate, for each variable, the proportion of missing values over the total number of observations. We can do it using isnull() method :
def Prop_NaN(Data):
df_prop_nan = Data.isnull().sum() / len(Data)
return df_prop_nanProp_NaN(Data=Netflix_Data)id 0.000000
title 0.000171
type 0.000000
description 0.003077
release_year 0.000000
age_certification 0.447692
runtime 0.000000
genres 0.000000
production_countries 0.000000
seasons 0.640000
imdb_id 0.068889
imdb_score 0.082393
imdb_votes 0.085128
tmdb_popularity 0.015556
tmdb_score 0.053162
dtype: float64
We can see that there are variables with a high proportion of missing values, as age_certification (44.77%).
season would be the variable with higher proportion of missing values, but it is because of season only is defined for type=SHOW.
2.9 Variable Scaling
Scaling a variable is applying a transformation, in order to obtain new properties for the transformed variable, properties that the original variable doesn’t have.
In this article, we will focus on three scaling methods: standard scaling, normalization (0,1), and normalization (a,b).
In any case, there are more procedures that will not be explored here, so for a more extensive list, it is recommended to consult the sklearn documentation on this topic: https://scikit-learn.org/stable/modules/preprocessing.html
Some of the concepts that appear in this secction will be explained with more details in Statistical Description section, such as the concept of statistical variable, sample, mean and variance.
2.9.1 Standard Scaling
Given a quantitative statistical variable \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\), and a sample \(\hspace{0.05cm}X_k \hspace{0.05cm} = \hspace{0.05cm} \left( \hspace{0.01cm} x_{1k} \hspace{0.01cm} , \hspace{0.01cm}x_{2k}\hspace{0.01cm},\dots ,\hspace{0.01cm} x_{nk} \hspace{0.01cm}\right)^t\hspace{0.05cm}\) of that statistical variable.
\(\hspace{0.25cm}\) The standard scaling version of \(\hspace{0.07cm} X_k\hspace{0.07cm}\) is defined as: \(\\[0.25cm]\)
\[X_k^{std} \hspace{0.1cm} =\hspace{0.1cm} \dfrac{X_k - \overline{X}_k}{\sigma(X_k)} \\\]
Properties:
\(\hspace{0.1cm} \overline{\hspace{0.01cm}X\hspace{0.07cm}}_k^{\hspace{0.07cm}std} \hspace{0.1cm} =\hspace{0.1cm} 0 \\[0.8cm]\)
\(\hspace{0.2cm} \sigma\left(\hspace{0.07cm} X_k^{\hspace{0.07cm}std} \hspace{0.07cm}\right)^2 \hspace{0.1cm} =\hspace{0.1cm} 1 \\\)
Proof :
\(\overline{X}_k ^{\hspace{0.07cm}std} \hspace{0.1cm} =\hspace{0.1cm} \overline{ \left( \dfrac{X_k - \overline{X_k}}{\sigma(X_j)} \right) } \hspace{0.1cm} = \hspace{0.1cm} \dfrac{1}{\sigma(X_j)} \cdot \left( \hspace{0.12cm} \overline{ \hspace{0.08cm} X_j - \overline{X_k} \hspace{0.08cm} } \hspace{0.12cm} \right) \hspace{0.1cm} = \hspace{0.1cm} \dfrac{1}{\sigma(X_j)} \cdot \left( \hspace{0.12cm} \overline{X_j} - \overline{ \hspace{0.08cm} \overline{X_j} \hspace{0.08cm} } \hspace{0.12cm} \right) \hspace{0.1cm} = \hspace{0.1cm} \dfrac{1}{\sigma(X_j)} \cdot \left( \hspace{0.08cm} \overline{X_j} - \overline{X_j} \hspace{0.08cm} \right) \hspace{0.1cm} =\hspace{0.1cm} \dfrac{1}{\sigma(X_j)} \hspace{0.07cm}\cdot \hspace{0.07cm} 0 \hspace{0.1cm}=\hspace{0.1cm} 0 \\[0.8cm]\) \(\\[0.6cm]\)
\(\sigma\left( X_k^{\hspace{0.07cm}std} \right)^2 \hspace{0.1cm} =\hspace{0.1cm} \sigma\left( \dfrac{X_k - \overline{X_k} }{\sigma(X_k)} \right)^2 \hspace{0.1cm} =\hspace{0.1cm} \dfrac{1}{\sigma(X_k)^2} \cdot \sigma\left( \hspace{0.08cm} X_k - \overline{X_k} \hspace{0.08cm} \right)^2 \hspace{0.1cm} =\hspace{0.1cm} \dfrac{1}{\sigma(X_k)^2} \cdot \sigma( \hspace{0.08cm} X_j \hspace{0.08cm} )^2 \hspace{0.1cm}=\hspace{0.1cm} 1\)
2.9.1.1 Standard Scaling in Python
from sklearn import preprocessingdef standard_scaling(Data, Variable_name):
scaler = preprocessing.StandardScaler().fit(Data[[Variable_name]])
Data[Variable_name + '_std_scaling'] = scaler.transform(Netflix_Data[[Variable_name]])standard_scaling(Data=Netflix_Data, Variable_name='release_year')Netflix_Data[['release_year', 'release_year_std_scaling']]| release_year | release_year_std_scaling | |
|---|---|---|
| 0 | 1945.0 | -10.294901 |
| 1 | 1976.0 | -5.826196 |
| 2 | 1972.0 | -6.402803 |
| 3 | 1975.0 | -5.970348 |
| 4 | 1967.0 | -7.123562 |
| … | … | … |
| 5845 | 2021.0 | 0.660634 |
| 5846 | 2021.0 | 0.660634 |
| 5847 | 2021.0 | 0.660634 |
| 5848 | 2021.0 | 0.660634 |
| 5849 | 2021.0 | 0.660634 |
5850 rows × 2 columns
Netflix_Data['release_year_std_scaling'].mean()-1.010549668636587e-14
Netflix_Data['release_year_std_scaling'].std()1.0000854810447344
2.9.2 Standardization (0,1)
Given a quantitative statistical variable \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\), and a sample \(\hspace{0.05cm}X_k \hspace{0.05cm} = \hspace{0.05cm} \left( \hspace{0.01cm} x_{1k} \hspace{0.01cm} , \hspace{0.01cm}x_{2k}\hspace{0.01cm},\dots ,\hspace{0.01cm} x_{nk} \hspace{0.01cm}\right)^t\hspace{0.05cm}\) of that statistical variable.
\(\hspace{0.25cm}\) The standardization a \(\hspace{0.05cm}(0,1)\hspace{0.05cm}\) version of \(\hspace{0.07cm}X_k\hspace{0.07cm}\) is defined as: \(\\[0.5cm]\)
\[X_k^{std(0,1)} \hspace{0.07cm}=\hspace{0.07cm} \dfrac{\hspace{0.07cm} X_k - Min(X_k) \hspace{0.07cm}}{Max(X_k) - Min(X_k)} \\\]
Properties :
\(\hspace{0.1cm} Max \left(X_j^{std(0,1)} \right) \hspace{0.1cm}=\hspace{0.1cm} 1 \\[0.8cm]\)
\(\hspace{0.1cm} Min \left( X_j^{std(0,1)} \right) \hspace{0.1cm}=\hspace{0.1cm} 0 \\\)
Proof :
\(\hspace{0.1cm} Max \left( X_k^{std(0,1)} \right) \hspace{0.1cm} = \hspace{0.1cm} \dfrac{\hspace{0.07cm} Max(X_k) - Min(X_k) \hspace{0.07cm}}{Max(X_k) - Min(X_k)} \hspace{0.1cm}=\hspace{0.1cm} 1 \\[0.8cm]\)
\(\hspace{0.1cm} Min \left( X_k^{std(0,1)} \right) \hspace{0.1cm}=\hspace{0.1cm} \dfrac{\hspace{0.07cm} Min(X_k) - Min(X_k) \hspace{0.07cm}}{Max(X_k) - Min(X_k)} \hspace{0.1cm}=\hspace{0.1cm} 0\)
2.9.2.1 Standardization (0,1) in Python
def Normalization(Data, Variable_name, min, max):
min_max_scaler = preprocessing.MinMaxScaler(feature_range=(min,max))
min = str(min)
max = str(max)
Data[Variable_name + '_norm_' + min + '_' + max] = min_max_scaler.fit_transform(Data[[Variable_name]])Normalization(Data=Netflix_Data, Variable_name='release_year', min=0, max=1)Netflix_Data[['release_year', 'release_year_norm_0_1']]| release_year | release_year_norm_0_1 | |
|---|---|---|
| 0 | 1945.0 | 0.000000 |
| 1 | 1976.0 | 0.402597 |
| 2 | 1972.0 | 0.350649 |
| 3 | 1975.0 | 0.389610 |
| 4 | 1967.0 | 0.285714 |
| … | … | … |
| 5845 | 2021.0 | 0.987013 |
| 5846 | 2021.0 | 0.987013 |
| 5847 | 2021.0 | 0.987013 |
| 5848 | 2021.0 | 0.987013 |
| 5849 | 2021.0 | 0.987013 |
5850 rows × 2 columns
Netflix_Data['release_year_norm_0_1'].min()0.0
Netflix_Data['release_year_norm_0_1'].max()1.0
2.9.3 Standardization (a,b)
Given a quantitative statistical variable \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\), and a sample \(\hspace{0.05cm}X_k \hspace{0.05cm} = \hspace{0.05cm} \left( \hspace{0.01cm} x_{1k} \hspace{0.01cm} , \hspace{0.01cm}x_{2k}\hspace{0.01cm},\dots ,\hspace{0.01cm} x_{nk} \hspace{0.01cm}\right)^t\hspace{0.05cm}\) of that statistical variable.
\(\hspace{0.25cm}\) The standardization a \(\hspace{0.05cm}(a,b)\hspace{0.05cm}\) version of \(\hspace{0.07cm}X_k\hspace{0.07cm}\) is defined as: \(\\[0.5cm]\)
\[X_k^{std(a,b)} \hspace{0.07cm}=\hspace{0.07cm} X_k^{std(0,1)} \cdot (b - a) + a \\\]
\(\hspace{0.25cm}\) where: \(a, b \in \mathbb{R} \\\)
Properties :
\(\hspace{0.1cm} Max \left(X_k^{std(a,b)} \right) = b \\[0.8cm]\)
\(\hspace{0.1cm} Min \left( X_k^{std(a,b)} \right) = a \\\)
Proof :
\(\hspace{0.1cm} Max \left(X_k^{std(a,b)} \right) \hspace{0.07cm}=\hspace{0.07cm} Max \left(X_k^{std(0,1)} \right)\cdot (a-b)+b \hspace{0.07cm}=\hspace{0.07cm} 1\cdot (b-a)+a \hspace{0.07cm}=\hspace{0.07cm} b \\[0.8cm]\)
\(\hspace{0.1cm} Min \left(X_k^{std(a,b)} \right) \hspace{0.07cm}=\hspace{0.07cm} Min \left(X_k^{std(0,1)} \right)\cdot (a-b)+b \hspace{0.07cm}=\hspace{0.07cm} 0\cdot (b-a)+a \hspace{0.07cm}=\hspace{0.07cm} a\)
2.9.4 Standardization (a,b) in Python
Normalization(Data=Netflix_Data, Variable_name='release_year', min=2, max=7)Netflix_Data[['release_year', 'release_year_norm_2_7']]| release_year | release_year_norm_2_7 | |
|---|---|---|
| 0 | 1945.0 | 2.000000 |
| 1 | 1976.0 | 4.012987 |
| 2 | 1972.0 | 3.753247 |
| 3 | 1975.0 | 3.948052 |
| 4 | 1967.0 | 3.428571 |
| … | … | … |
| 5845 | 2021.0 | 6.935065 |
| 5846 | 2021.0 | 6.935065 |
| 5847 | 2021.0 | 6.935065 |
| 5848 | 2021.0 | 6.935065 |
| 5849 | 2021.0 | 6.935065 |
5850 rows × 2 columns
Netflix_Data['release_year_norm_2_7'].min()2.0
Netflix_Data['release_year_norm_2_7'].max()6.999999999999986
2.10 Standard Recoding of Categorical Variables
Given a categorical statistical variable \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\) with range \(\hspace{0.07cm}R( X_k) = \lbrace g_1, g_2 , ..., g_h \rbrace\hspace{0.07cm}\), and a sample \(\hspace{0.05cm}X_k \hspace{0.05cm} = \hspace{0.05cm} \left( \hspace{0.01cm} x_{1k} \hspace{0.01cm} , \hspace{0.01cm}x_{2k}\hspace{0.01cm},\dots ,\hspace{0.01cm} x_{nk} \hspace{0.01cm}\right)^t\hspace{0.05cm}\) of that statistical variable.
\(\hspace{0.25cm}\) The standard recode of \(\hspace{0.07cm}X_k\hspace{0.07cm}\) consists of obtaining a new sample \(\hspace{0.1cm}X_k^{cod}\hspace{0.1cm}\) defined as :
\[x_{ik}^{\hspace{0.07cm}cod} = \left\lbrace\begin{array}{l} 0 \hspace{0.2cm} , \hspace{0.2cm} \text{ if} \hspace{0.3cm} x_{ik} = g_1 \\ 1 \hspace{0.2cm} , \hspace{0.2cm} \text{ if} \hspace{0.3cm} x_{ik} = g_2 \\ ... \\ h-1 \hspace{0.2cm} , \hspace{0.2cm} \text{ if} \hspace{0.3cm} x_{ik} = g_h \end{array}\right. \\ \]
Properties :
- \(\hspace{0.1cm}R( X_k^{cod}) = \lbrace 0,1,..., h-1 \rbrace\)
2.10.1 Standard Recoding of Categorical Variables in Python
from sklearn.preprocessing import OrdinalEncoderdef Standard_recoding(Data, Variable_name) :
Data[Variable_name + '_recode'] = OrdinalEncoder().fit_transform(Data[[Variable_name]])Standard_recoding(Data=Netflix_Data , Variable_name = 'type')Netflix_Data.loc[ : , ['type','type_recode']].head()| type | type_recode | |
|---|---|---|
| 0 | SHOW | 1.0 |
| 1 | MOVIE | 0.0 |
| 2 | MOVIE | 0.0 |
| 3 | MOVIE | 0.0 |
| 4 | MOVIE | 0.0 |
Standard_recoding(Data=Netflix_Data , Variable_name = 'age_certification')df1 = pd.DataFrame()
for i in range(0,11):
df2 = Netflix_Data.loc[ Netflix_Data['age_certification_recode'] == i , ['age_certification','age_certification_recode'] ].head(1)
df1 = pd.concat([df1 , df2], axis=0)df1| age_certification | age_certification_recode | |
|---|---|---|
| 162 | G | 0.0 |
| 198 | NC-17 | 1.0 |
| 3 | PG | 2.0 |
| 11 | PG-13 | 3.0 |
| 1 | R | 4.0 |
| 5 | TV-14 | 5.0 |
| 46 | TV-G | 6.0 |
| 0 | TV-MA | 7.0 |
| 35 | TV-PG | 8.0 |
| 45 | TV-Y | 9.0 |
| 95 | TV-Y7 | 10.0 |
2.11 Categorization of Quantitative Variables
Given a quantitative statistical variable \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\), and a sample \(\hspace{0.05cm}X_k \hspace{0.05cm} = \hspace{0.05cm} \left( \hspace{0.01cm} x_{1k} \hspace{0.01cm} , \hspace{0.01cm}x_{2k}\hspace{0.01cm},\dots ,\hspace{0.01cm} x_{nk} \hspace{0.01cm}\right)^t\hspace{0.05cm}\) of that statistical variable.
\(\hspace{0.25cm}\) Categorization of \(\hspace{0.1cm}X_k\hspace{0.1cm}\) consist of obtaining a new sample \(\hspace{0.1cm}X_k^{cat}\hspace{0.1cm}\) defined as: \(\\[0.3cm]\)
\[x_{ik}^{cat} = \left\lbrace\begin{array}{l} 0 \hspace{0.3cm} , \hspace{0.3cm} \text{ if} \hspace{0.3cm} x_{ik} \in [L_0 , L_1) \\ 1 \hspace{0.3cm} , \hspace{0.3cm} \text{if} \hspace{0.3cm} x_{ik} \in [L_1 , L_2) \\ ... \\ h-1 \hspace{0.3cm} , \hspace{0.3cm} \text{ if} \hspace{0.3cm} x_{ik} \in [L_{h-1} , L_h) \end{array}\right. \\[1.3cm] \]
\(\hspace{0.25cm}\) Another way of expressing it is: \(\\[0.3cm]\)
\[x_{ik}^{cat} = r \hspace{0.3cm} \Leftrightarrow \hspace{0.3cm} x_{ik} \in [L_r , L_{r+1}) \\\]
\(\hspace{0.25cm}\) where:
\(\hspace{0.25cm}\) \([L_0 , L_1) \hspace{0.03cm} , \hspace{0.03cm} [L_1 , L_2) \hspace{0.03cm}, \dots ,\hspace{0.03cm} [L_{h-1} , L_h] \hspace{0.12cm}\) are called the categorization intervals of \(\hspace { 0.07cm}X_k\hspace{0.07cm}\) , and are intervals with the following properties:
They are disjoint two by two, that is, they do not share elements. \(\\[0.3cm]\)
Every observation in the sample \(\hspace{0.1cm}X_k\hspace{0.1cm}\) belongs to some interval.
\(\hspace{0.25cm}\) As a consequence:
- Each element of \(\hspace{0.1cm}X_k\hspace{0.1cm}\) belongs to a single interval. \(\\[0.15cm]\)
¿How to define the categorization intervals?
There are different procedures to define the categorization intervals. Some of the most common are quantile-based rules.
Next, we are going to expose some procedures based on quantiles and another alternative, Scott’s rule.
2.11.1 Mean Rule
Following the mean rule, the categorization intervals of a quantitative variable \(\hspace{0.1cm} X_k\hspace{0.1cm}\) would be the following: \(\\[0.2cm]\)
\[ \left(\hspace{0.1cm} Min(X_k) \hspace{0.07cm}-\hspace{0.07cm} c \hspace{0.1cm} ,\hspace{0.1cm} \overline{X}_k \hspace{0.1cm}\right] \hspace{0.1cm},\hspace{0.1cm} \left(\hspace{0.1cm} \overline{X}_k \hspace{0.1cm},\hspace{0.1cm} Max(X_k) \hspace{0.1cm}\right] \\[0.6cm] \]
\(\hspace{0.25cm}\) With the mean rule, the categorical version of \(\hspace{0.07cm}X_k\hspace{0.07cm}\) is defined as: \(\\[0.3cm]\)
\[x_{ik}^{\hspace{0.07cm}cat} \hspace{0.07cm}= \hspace{0.07cm} \left\lbrace\begin{array}{l} 0 \hspace{0.3cm} , \hspace{0.3cm} \text{ if} \hspace{0.3cm} x_{ij} \in \left(\hspace{0.1cm}Min(X_k)\hspace{0.1cm},\hspace{0.1cm} \overline{X}_k \hspace{0.1cm}\right] \\ 1 \hspace{0.3cm} , \hspace{0.3cm} \text{ if} \hspace{0.3cm} x_{ik} \in \left(\hspace{0.1cm}\overline{X}_k \hspace{0.1cm},\hspace{0.1cm} Max(X_j)\hspace{0.1cm}\right] \end{array}\right. \\[1cm] \]
2.11.2 Median Rule
Following the median rule, the categorization intervals of a quantitative variable \(\hspace{0.1cm} X_k\hspace{0.1cm}\) would be the following: \(\\[0.2cm]\)
\[ \Bigl(\hspace{0.1cm} Min(X_k) \hspace{0.07cm}-\hspace{0.07cm} c \hspace{0.1cm} ,\hspace{0.1cm} Me({X}_k) \hspace{0.1cm}\Bigr] \hspace{0.1cm},\hspace{0.1cm} \Bigl(\hspace{0.1cm} Me({X}_k) \hspace{0.1cm},\hspace{0.1cm} Max(X_k) \hspace{0.1cm}\Bigr] \\[0.6cm] \]
\(\hspace{0.25cm}\) With the mean rule, the categorical version of \(\hspace{0.07cm}X_k\hspace{0.07cm}\) is defined as: \(\\[0.3cm]\)
\[x_{ik}^{\hspace{0.07cm}cat} \hspace{0.07cm}= \hspace{0.07cm} \left\lbrace\begin{array}{l} 0 \hspace{0.3cm} , \hspace{0.3cm} \text{ if} \hspace{0.3cm} x_{ij} \in \Bigl(\hspace{0.1cm}Min(X_k)\hspace{0.1cm},\hspace{0.1cm} Me({X}_k) \hspace{0.1cm}\Bigr] \\ 1 \hspace{0.3cm} , \hspace{0.3cm} \text{ if} \hspace{0.3cm} x_{ik} \in \Bigl(\hspace{0.1cm} Me({X}_k) \hspace{0.1cm},\hspace{0.1cm} Max(X_j)\hspace{0.1cm}\Bigr] \end{array}\right. \\[1cm] \]
2.11.3 Quartile’s Rule
Following the quartile rule, the categorization intervals of a quantitative variable \(\hspace{0.1cm} X_k\hspace{0.1cm}\) would be the following: \(\\[0.2cm]\)
\[ \Bigl(\hspace{0.1cm} Q(\hspace{0.03cm}0 \hspace{0.03cm},\hspace{0.03cm} {X}_k \hspace{0.03cm}) \hspace{0.07cm}-\hspace{0.07cm} c \hspace{0.1cm} ,\hspace{0.1cm} Q(\hspace{0.03cm}0.25 \hspace{0.03cm},\hspace{0.03cm} {X}_k \hspace{0.03cm}) \hspace{0.1cm}\Bigr] \hspace{0.1cm},\hspace{0.1cm} \Bigl(\hspace{0.1cm} Q(\hspace{0.03cm}0.25 \hspace{0.03cm},\hspace{0.03cm} {X}_k \hspace{0.03cm}) \hspace{0.1cm},\hspace{0.1cm} Q(\hspace{0.03cm}0.50 \hspace{0.03cm},\hspace{0.03cm} {X}_k \hspace{0.03cm}) \hspace{0.1cm}\Bigr] \hspace{0.1cm},\hspace{0.1cm} \Bigl(\hspace{0.1cm} Q(\hspace{0.03cm}0.50 \hspace{0.03cm},\hspace{0.03cm} {X}_k \hspace{0.03cm}) \hspace{0.1cm},\hspace{0.1cm} Q(\hspace{0.03cm}0.75 \hspace{0.03cm},\hspace{0.03cm} {X}_k \hspace{0.03cm}) \hspace{0.1cm}\Bigr] \hspace{0.1cm},\hspace{0.1cm} \Bigl(\hspace{0.1cm} Q(\hspace{0.03cm}0.75 \hspace{0.03cm},\hspace{0.03cm} {X}_k \hspace{0.03cm}) \hspace{0.1cm},\hspace{0.1cm} Max(X_k) \hspace{0.1cm}\Bigr] \\[0.6cm] \]
\(\hspace{0.25cm}\) With the mean rule, the categorical version of \(\hspace{0.07cm}X_k\hspace{0.07cm}\) is defined as: \(\\[0.3cm]\)
\[x_{ik}^{\hspace{0.07cm}cat} \hspace{0.07cm}= \hspace{0.07cm} \left\lbrace\begin{array}{l} 0 \hspace{0.3cm} , \hspace{0.3cm} \text{ if} \hspace{0.3cm} x_{ij} \in \Bigl(\hspace{0.1cm} Min(X_k) \hspace{0.1cm} ,\hspace{0.1cm} Q(\hspace{0.03cm}0.25 \hspace{0.03cm},\hspace{0.03cm} {X}_k \hspace{0.03cm}) \hspace{0.1cm}\Bigr) \\[0.15cm] 1 \hspace{0.3cm} , \hspace{0.3cm} \text{ if} \hspace{0.3cm} x_{ik} \in \Bigl(\hspace{0.1cm} Q(\hspace{0.03cm}0.25 \hspace{0.03cm},\hspace{0.03cm} {X}_k \hspace{0.03cm}) \hspace{0.1cm},\hspace{0.1cm} Q(\hspace{0.03cm}0.50 \hspace{0.03cm},\hspace{0.03cm} {X}_k \hspace{0.03cm}) \hspace{0.1cm}\Bigr) \\[0.15cm] 2 \hspace{0.3cm} , \hspace{0.3cm} \text{ if} \hspace{0.3cm} x_{ik} \in \Bigl(\hspace{0.1cm} Q(\hspace{0.03cm}0.50 \hspace{0.03cm},\hspace{0.03cm} {X}_k \hspace{0.03cm}) \hspace{0.1cm},\hspace{0.1cm} Q(\hspace{0.03cm}0.75 \hspace{0.03cm},\hspace{0.03cm} {X}_k \hspace{0.03cm}) \hspace{0.1cm}\Bigr) \\[0.15cm] 3 \hspace{0.3cm} , \hspace{0.3cm} \text{ if} \hspace{0.3cm} x_{ik} \in \Bigl(\hspace{0.1cm} Q(\hspace{0.03cm}0.75 \hspace{0.03cm},\hspace{0.03cm} {X}_k \hspace{0.03cm}) \hspace{0.1cm},\hspace{0.1cm} Max(X_k) \hspace{0.1cm}\Bigr] \end{array}\right. \\[0.5cm] \]
2.11.4 Decile’s Rule
Following the quartile rule, the categorization intervals of a quantitative variable \(\hspace{0.1cm} X_k\hspace{0.1cm}\) would be the following: \(\\[0.2cm]\)
\[ \Bigl(\hspace{0.1cm} Q(\hspace{0.03cm}0 \hspace{0.03cm},\hspace{0.03cm} {X}_k \hspace{0.03cm}) \hspace{0.07cm}-\hspace{0.07cm} c \hspace{0.1cm} ,\hspace{0.1cm} Q(\hspace{0.03cm}0.1 \hspace{0.03cm},\hspace{0.03cm} {X}_k \hspace{0.03cm}) \hspace{0.1cm}\Bigr] \hspace{0.1cm},\hspace{0.1cm} \Bigl(\hspace{0.1cm} Q(\hspace{0.03cm}0.1 \hspace{0.03cm},\hspace{0.03cm} {X}_k \hspace{0.03cm}) \hspace{0.1cm},\hspace{0.1cm} Q(\hspace{0.03cm}0.2 \hspace{0.03cm},\hspace{0.03cm} {X}_k \hspace{0.03cm}) \hspace{0.1cm}\Bigr] \hspace{0.1cm},\hspace{0.1cm} \Bigl(\hspace{0.1cm} Q(\hspace{0.03cm}0.2 \hspace{0.03cm},\hspace{0.03cm} {X}_k \hspace{0.03cm}) \hspace{0.1cm},\hspace{0.1cm} Q(\hspace{0.03cm}0.3 \hspace{0.03cm},\hspace{0.03cm} {X}_k \hspace{0.03cm}) \hspace{0.1cm}\Bigr] \hspace{0.1cm}, \dots ,\hspace{0.1cm} \Bigl(\hspace{0.1cm} Q(\hspace{0.03cm}0.9 \hspace{0.03cm},\hspace{0.03cm} {X}_k \hspace{0.03cm}) \hspace{0.1cm},\hspace{0.1cm} Q(\hspace{0.03cm}1 \hspace{0.03cm},\hspace{0.03cm} {X}_k \hspace{0.03cm}) \hspace{0.1cm}\Bigr] \\[0.6cm] \]
\(\hspace{0.25cm}\) With the mean rule, the categorical version of \(\hspace{0.07cm}X_k\hspace{0.07cm}\) is defined as: \(\\[0.3cm]\)
\[x_{ik}^{\hspace{0.07cm}cat} \hspace{0.07cm}= \hspace{0.07cm} \left\lbrace\begin{array}{l} 0 \hspace{0.3cm} , \hspace{0.3cm} \text{ if} \hspace{0.3cm} x_{ij} \in \Bigl(\hspace{0.1cm} Q(\hspace{0.03cm}0 \hspace{0.03cm},\hspace{0.03cm} {X}_k \hspace{0.03cm}) \hspace{0.1cm} ,\hspace{0.1cm} Q(\hspace{0.03cm}0.1 \hspace{0.03cm},\hspace{0.03cm} {X}_k \hspace{0.03cm}) \hspace{0.1cm}\Bigr) \\[0.15cm] 1 \hspace{0.3cm} , \hspace{0.3cm} \text{ if} \hspace{0.3cm} x_{ik} \in \Bigl(\hspace{0.1cm} Q(\hspace{0.03cm}0.1 \hspace{0.03cm},\hspace{0.03cm} {X}_k \hspace{0.03cm}) \hspace{0.1cm},\hspace{0.1cm} Q(\hspace{0.03cm}0.2 \hspace{0.03cm},\hspace{0.03cm} {X}_k \hspace{0.03cm}) \hspace{0.1cm}\Bigr) \\[0.15cm] 2 \hspace{0.3cm} , \hspace{0.3cm} \text{ if} \hspace{0.3cm} x_{ik} \in \Bigl(\hspace{0.1cm} Q(\hspace{0.03cm}0.2 \hspace{0.03cm},\hspace{0.03cm} {X}_k \hspace{0.03cm}) \hspace{0.1cm},\hspace{0.1cm} Q(\hspace{0.03cm}0.3 \hspace{0.03cm},\hspace{0.03cm} {X}_k \hspace{0.03cm}) \hspace{0.1cm}\Bigr) \\ \dots \\ 9 \hspace{0.3cm} , \hspace{0.3cm} \text{ if} \hspace{0.3cm} x_{ik} \in \Bigl(\hspace{0.1cm} Q(\hspace{0.03cm}0.9 \hspace{0.03cm},\hspace{0.03cm} {X}_k \hspace{0.03cm}) \hspace{0.1cm},\hspace{0.1cm} Q(\hspace{0.03cm} 1 \hspace{0.03cm},\hspace{0.03cm} {X}_k \hspace{0.03cm}) \hspace{0.1cm}\Bigr] \end{array}\right. \\[0.5cm] \]
2.11.5 Quantile’s Rule
This rule is a generalization of the last two.
Following quantile rule we have the following series of categorization intervals of \(\hspace{0.07cm}X_k\hspace{0.07cm}\) : \(\\[0.25cm]\)
\[\begin{align} & \biggl\{ \hspace{0.1cm} \Bigl( \hspace{0.07cm} Q( \hspace{0.07cm} q_{\hspace{0.07cm}i} \hspace{0.07cm} , X_k \hspace{0.07cm} ) \hspace{0.07cm} - \hspace{0.07cm} c_i \hspace{0.12cm}, \hspace{0.12cm} Q( \hspace{0.07cm}q_{\hspace{0.07cm}i+1} \hspace{0.07cm}, \hspace{0.07cm} X_k \hspace{0.07cm}) \hspace{0.07cm} \Bigr] \hspace{0.2cm} : \hspace{0.2cm} i \in \lbrace 1,2,...,h \rbrace \hspace{0.1cm} \biggr\} \hspace{0.2cm} = \\[0.5cm] = \hspace{0.2cm} &\biggl\{ \hspace{0.1cm} \Bigl( \hspace{0.07cm} Q( \hspace{0.07cm} q_1 \hspace{0.07cm} , X_k \hspace{0.07cm} ) \hspace{0.07cm}+ \hspace{0.07cm} c \hspace{0.12cm}, \hspace{0.12cm} Q( \hspace{0.07cm}q_{2} \hspace{0.07cm}, \hspace{0.07cm} X_k \hspace{0.07cm}) \hspace{0.07cm} \Bigr] \hspace{0.1cm} , \hspace{0.1cm} \Bigl( \hspace{0.07cm} Q( \hspace{0.07cm} q_2 \hspace{0.07cm} , X_k \hspace{0.07cm} ) \hspace{0.12cm}, \hspace{0.12cm} Q( \hspace{0.07cm}q_{3} \hspace{0.07cm}, \hspace{0.07cm} X_k \hspace{0.07cm}) \hspace{0.07cm} \Bigr] \hspace{0.1cm} , \hspace{0.1cm} \dots \hspace{0.1cm} , \hspace{0.1cm} \Bigl( \hspace{0.07cm} Q( \hspace{0.07cm} q_h \hspace{0.07cm} , X_k \hspace{0.07cm} ) \hspace{0.12cm}, \hspace{0.12cm} Q( \hspace{0.07cm}q_{h+1} \hspace{0.07cm}, \hspace{0.07cm} X_k \hspace{0.07cm}) \hspace{0.07cm} \Bigr] \hspace{0.1cm} \biggr\} \end{align}\] \(\\[0.25cm]\)
where:
\(t \in (0 , 1)\hspace{0.07cm}\) is a parameter chosen by the analyst. \(\\[0.5cm]\)
\(q_{\hspace{0.07cm}i}\hspace{0.07cm}\) is defined as follows:
\[q_{\hspace{0.07cm}i} \hspace{0.07cm} = \hspace{0.07cm} \left\lbrace\begin{array}{l} 0 \hspace{0.3cm} , \hspace{0.3cm} \text{ if} \hspace{0.4cm} i=1 \\[0.15cm] 1 \hspace{0.3cm} , \hspace{0.3cm} \text{ if} \hspace{0.4cm} i\hspace{0.07cm}=\hspace{0.07cm}h+1 \\[0.15cm] q_{\hspace{0.07cm} i-1} \hspace{0.07cm}+\hspace{0.07cm} t \hspace{0.3cm} , \hspace{0.3cm} \text{ if} \hspace{0.4cm} i \in \lbrace 2,3,...,h \rbrace \end{array}\right. \] \(\\[0.5cm]\)
- \(h\hspace{0.07cm}\) is defined as follows:
\[h \hspace{0.07cm} = \hspace{0.07cm} \left\lceil \dfrac{1}{t}\right\rceil\] \(\\[0.5cm]\)
- \(c_{\hspace{0.07cm}i} \hspace{0.07cm}\) is defined as follows:
\[c_{\hspace{0.07cm}i} \hspace{0.07cm} = \hspace{0.07cm} \left\lbrace\begin{array}{l} 0 \hspace{0.3cm} , \hspace{0.3cm} \text{ if} \hspace{0.4cm} i\neq 1 \\[0.15cm] c > 0 \hspace{0.3cm} , \hspace{0.3cm} \text{ if} \hspace{0.4cm} i\hspace{0.07cm}=\hspace{0.07cm}1 \end{array}\right. \] \(\\[0.5cm]\)
\(\lbrace \hspace{0.07cm}q_ i \hspace{0.07cm}:\hspace{0.07cm} i=1,...,h \hspace{0.07cm}\rbrace \hspace{0.07cm} = \hspace{0.07cm} \lbrace \hspace{0.07cm}q_1\hspace{0.07cm},\hspace{0.07cm} q_2\hspace{0.07cm}, \dots ,\hspace{0.07cm} q_h \hspace{0.07cm}\rbrace \hspace{0.2cm}\) is an ordered set containing numbers between \(\hspace{0.07cm}0\hspace{0.07cm}\) and \(\hspace{0.07cm}1\hspace{0.07cm}\), starting at \(\hspace{0.07cm}0\hspace{0.07cm}\) and with a space of degree \(\hspace{0.07cm}t\hspace{0.07cm}\) between them. That is, it is an ordered set with numbers between \(\hspace{0.07cm}0\hspace{0.07cm}\) and \(\hspace{0.07cm}1\hspace{0.07cm}\), that starts at zero and its numbers are equispaced in degree \(\hspace{0.07cm}t\hspace{0.07cm}\).
Formally, It’s an ordered set that starts at zero and it’s also a subset of \(\hspace{0.07cm}[0,1]\hspace{0.07cm}\) because \(\hspace{0.12cm} q_1 = 0 \hspace{0.07cm}\leq\hspace{0.07cm} q_2 \hspace{0.07cm}\leq \dots \leq\hspace{0.07cm} q_h \hspace{0.07cm}\leq\hspace{0.07cm} 1\hspace{0.12cm}\). In the other hand, It’s equispaced in degree \(\hspace{0.07cm}t\hspace{0.07cm}\) because \(\hspace{0.12cm}q_i - q_{i-1} = t\hspace{0.12cm}\) , for \(\hspace{0.12cm}i = 2,...,h\).
For example, if \(\hspace{0.12cm}t=0.15\hspace{0.12cm}\), then \(\lbrace\hspace{0.1cm} q_ i \hspace{0.1cm}:\hspace{0.1cm} i=1,...,h \hspace{0.07cm}\rbrace \hspace{0.07cm}=\hspace{0.07cm} \lbrace\hspace{0.07cm} 0\hspace{0.07cm},\hspace{0.07cm} 0.15\hspace{0.07cm},\hspace{0.07cm} 0.3\hspace{0.07cm},\hspace{0.07cm} 0.45\hspace{0.07cm},\hspace{0.07cm} 0.6\hspace{0.07cm},\hspace{0.07cm} 0.75\hspace{0.07cm},\hspace{0.07cm} 0.9 \hspace{0.07cm}\rbrace\) \(\\[0.4cm]\)
In order to understand in a better way this rule, we are going to derive manually the categorization intervals for two examples of \(\hspace{0.07cm}t\hspace{0.07cm}\). \(\\[0.35cm]\)
Example one: \(\hspace{0.2cm}t=0.15\) \(\\[0.45cm]\)
\(h = \left\lceil \dfrac{1}{0.15}\right\rceil = 7\) \(\\[1cm]\)
\(q_1 = 0\) \(\\[0.45cm]\)
\(q_2 = q_0 + t = 0 + 0.15 = 0.15\) \(\\[0.45cm]\)
\(q_3 = q_1 + t = 0.15 + 0.15 = 0.3\) \(\\[0.45cm]\)
\(q_4 = q_2 + t = 0.3 + 0.15 = 0.45\) \(\\[0.45cm]\)
\(q_5 = q_3 + t = 0.45 + 0.15 = 0.6\) \(\\[0.45cm]\)
\(q_6 = q_4 + t = 0.6 + 0.15 = 0.75\) \(\\[0.45cm]\)
\(q_h = q_7 = q_7 + t = 0.75 + 0.15 = 0.9\) \(\\[0.45cm]\)
\(q_{h+1} = q_8 = 1\) \(\\[1cm]\)
Therefore, the categorization intervals would be the following:
\[\Bigl( \hspace{0.07cm} Q( \hspace{0.07cm} 0 \hspace{0.07cm} , X_k \hspace{0.07cm} ) \hspace{0.07cm}- \hspace{0.07cm} c \hspace{0.12cm}, \hspace{0.12cm} Q( \hspace{0.07cm} 0.15 \hspace{0.07cm}, \hspace{0.07cm} X_k \hspace{0.07cm}) \hspace{0.07cm} \Bigr] \hspace{0.1cm} , \hspace{0.1cm} \Bigl( \hspace{0.07cm} Q( \hspace{0.07cm} 0.15 \hspace{0.07cm} , X_k \hspace{0.07cm} ) \hspace{0.12cm}, \hspace{0.12cm} Q( \hspace{0.07cm}0.3 \hspace{0.07cm}, \hspace{0.07cm} X_k \hspace{0.07cm}) \hspace{0.07cm} \Bigr] \hspace{0.1cm} , \hspace{0.1cm} \Bigl( \hspace{0.07cm} Q( \hspace{0.07cm} 0.3 \hspace{0.07cm} , X_k \hspace{0.07cm} ) \hspace{0.12cm}, \hspace{0.12cm} Q( \hspace{0.07cm}0.45 \hspace{0.07cm}, \hspace{0.07cm} X_k \hspace{0.07cm}) \hspace{0.07cm} \Bigr] \hspace{0.07cm} , \hspace{0.07cm} \dots \hspace{0.07cm} , \hspace{0.07cm} \Bigl( \hspace{0.07cm} Q( \hspace{0.07cm} 0.9 \hspace{0.07cm} , X_k \hspace{0.07cm} ) \hspace{0.12cm}, \hspace{0.12cm} Q( \hspace{0.07cm}1 \hspace{0.07cm}, \hspace{0.07cm} X_k \hspace{0.07cm}) \hspace{0.07cm} \Bigr] \]
Example two: \(\hspace{0.1cm}t=0.25\)
We are going to see that in this case we will obtain the same categorization intervals as with the quartile rule.
\(h = \left\lceil \dfrac{1}{0.25}\right\rceil = 4\) \(\\[1cm]\)
\(q_1 = 0\) \(\\[0.45cm]\)
\(q_2 = q_0 + t = 0 + 0.25 = 0.25\) \(\\[0.45cm]\)
\(q_3 = q_1 + t = 0.25 + 0.25 = 0.5\) \(\\[0.45cm]\)
\(q_h = q_4 = q_2 + t = 0.5 + 0.25 = 0.75\) \(\\[0.45cm]\)
\(q_{h+1} = q_5 = 1\) \(\\[0.45cm]\)
Therefore, the categorization intervals would be the following:
\[\Bigl( \hspace{0.07cm} Q( \hspace{0.07cm} 0 \hspace{0.07cm} , X_k \hspace{0.07cm} ) \hspace{0.07cm}- \hspace{0.07cm} c \hspace{0.12cm}, \hspace{0.12cm} Q( \hspace{0.07cm} 0.25 \hspace{0.07cm}, \hspace{0.07cm} X_k \hspace{0.07cm}) \hspace{0.07cm} \Bigr] \hspace{0.1cm} , \hspace{0.1cm} \Bigl( \hspace{0.07cm} Q( \hspace{0.07cm} 0.25 \hspace{0.07cm} , X_k \hspace{0.07cm} ) \hspace{0.12cm}, \hspace{0.12cm} Q( \hspace{0.07cm}0.5 \hspace{0.07cm}, \hspace{0.07cm} X_k \hspace{0.07cm}) \hspace{0.07cm} \Bigr] \hspace{0.1cm} , \hspace{0.1cm} \Bigl( \hspace{0.07cm} Q( \hspace{0.07cm} 0.5 \hspace{0.07cm} , X_k \hspace{0.07cm} ) \hspace{0.12cm}, \hspace{0.12cm} Q( \hspace{0.07cm}0.75 \hspace{0.07cm}, \hspace{0.07cm} X_k \hspace{0.07cm}) \hspace{0.07cm} \Bigr] \hspace{0.07cm} , \hspace{0.07cm} \Bigl( \hspace{0.07cm} Q( \hspace{0.07cm} 0.75 \hspace{0.07cm} , X_k \hspace{0.07cm} ) \hspace{0.12cm}, \hspace{0.12cm} Q( \hspace{0.07cm}1 \hspace{0.07cm}, \hspace{0.07cm} X_k \hspace{0.07cm}) \hspace{0.07cm} \Bigr] \] \(\\[0.5cm]\)
Also It’s easy to see that if we had used \(t=0.1\) we would have obtained the same categorization intervals as with the decile rule.
2.11.6 Categorization of Quantitative Variables in Python
2.11.6.1 Using Python specific rule
pd.cut(x=Netflix_Data['release_year'] , bins=5 )0 (1944.923, 1960.4]
1 (1975.8, 1991.2]
2 (1960.4, 1975.8]
3 (1960.4, 1975.8]
4 (1960.4, 1975.8]
...
5845 (2006.6, 2022.0]
5846 (2006.6, 2022.0]
5847 (2006.6, 2022.0]
5848 (2006.6, 2022.0]
5849 (2006.6, 2022.0]
Name: release_year, Length: 5850, dtype: category
Categories (5, interval[float64, right]): [(1944.923, 1960.4] < (1960.4, 1975.8] < (1975.8, 1991.2] < (1991.2, 2006.6] < (2006.6, 2022.0]]
pd.cut(x=Netflix_Data['release_year'] , bins=5 , labels=False)0 0
1 2
2 1
3 1
4 1
..
5845 4
5846 4
5847 4
5848 4
5849 4
Name: release_year, Length: 5850, dtype: int64
2.11.6.2 Using the median rule in Python
intervals = [Netflix_Data['release_year'].min() - 0.5 , Netflix_Data['release_year'].median(), Netflix_Data['release_year'].max() ]
intervals[1944.5, 2018.0, 2022.0]
pd.cut(x=Netflix_Data['release_year'], bins=intervals )0 (1944.5, 2018.0]
1 (1944.5, 2018.0]
2 (1944.5, 2018.0]
3 (1944.5, 2018.0]
4 (1944.5, 2018.0]
...
5845 (2018.0, 2022.0]
5846 (2018.0, 2022.0]
5847 (2018.0, 2022.0]
5848 (2018.0, 2022.0]
5849 (2018.0, 2022.0]
Name: release_year, Length: 5850, dtype: category
Categories (2, interval[float64, right]): [(1944.5, 2018.0] < (2018.0, 2022.0]]
pd.cut(x=Netflix_Data['release_year'], bins=intervals, labels=False )0 0
1 0
2 0
3 0
4 0
..
5845 1
5846 1
5847 1
5848 1
5849 1
Name: release_year, Length: 5850, dtype: int64
2.11.6.3 Using quartile’s rule
intervals = [Netflix_Data['release_year'].min() - 1 , Netflix_Data['release_year'].quantile(0.25), Netflix_Data['release_year'].quantile(0.5), Netflix_Data['release_year'].quantile(0.75), Netflix_Data['release_year'].max()]
intervals[1944.0, 2016.0, 2018.0, 2020.0, 2022.0]
Netflix_Data['release_year_cat_int_1'] = pd.cut(x=Netflix_Data['release_year'], bins=intervals )Netflix_Data['release_year_cat_1'] = pd.cut(x=Netflix_Data['release_year'], bins=intervals, labels=False )Netflix_Data.loc[: , ['release_year','release_year_cat_int_1','release_year_cat_1']]| release_year | release_year_cat_int | release_year_cat | |
|---|---|---|---|
| 0 | 1945.0 | (1944.0, 2016.0] | 0 |
| 1 | 1976.0 | (1944.0, 2016.0] | 0 |
| 2 | 1972.0 | (1944.0, 2016.0] | 0 |
| 3 | 1975.0 | (1944.0, 2016.0] | 0 |
| 4 | 1967.0 | (1944.0, 2016.0] | 0 |
| … | … | … | … |
| 5845 | 2021.0 | (2020.0, 2022.0] | 3 |
| 5846 | 2021.0 | (2020.0, 2022.0] | 3 |
| 5847 | 2021.0 | (2020.0, 2022.0] | 3 |
| 5848 | 2021.0 | (2020.0, 2022.0] | 3 |
| 5849 | 2021.0 | (2020.0, 2022.0] | 3 |
5850 rows × 3 columns
2.11.6.4 Using deciles rule
import numpy as npintervals = []
for q in np.arange(0, 1, step=0.1) :
intervals.append( Netflix_Data['imdb_score'].quantile(q))np.arange(0, 1, step=0.1)array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])
intervals[1.5, 5.0, 5.6, 6.0, 6.3, 6.6, 6.9, 7.2, 7.5, 7.9]
intervals[0] = intervals[0] - 0.5 # El primer extremo debe ser Q(0) - cintervals.append(Netflix_Data['imdb_score'].quantile(1)) # El ultimo extremo debe ser Q(1)intervals[1.0, 5.0, 5.6, 6.0, 6.3, 6.6, 6.9, 7.2, 7.5, 7.9, 9.6]
Netflix_Data['imdb_score_cat_int_1'] = pd.cut(x=Netflix_Data['imdb_score'], bins=intervals )Netflix_Data['imdb_score_cat_1'] = pd.cut(x=Netflix_Data['imdb_score'], bins=intervals, labels=False )Netflix_Data.loc[: , ['imdb_score','imdb_score_cat_int_1','imdb_score_cat_1']]| imdb_score | imdb_score_cat_int_1 | imdb_score_cat_1 | |
|---|---|---|---|
| 0 | NaN | NaN | NaN |
| 1 | 8.2 | (7.9, 9.6] | 9.0 |
| 2 | 7.7 | (7.5, 7.9] | 8.0 |
| 3 | 8.2 | (7.9, 9.6] | 9.0 |
| 4 | 7.7 | (7.5, 7.9] | 8.0 |
| … | … | … | … |
| 5845 | 6.8 | (6.6, 6.9] | 5.0 |
| 5846 | 7.7 | (7.5, 7.9] | 8.0 |
| 5847 | 3.8 | (1.0, 5.0] | 0.0 |
| 5848 | NaN | NaN | NaN |
| 5849 | 7.8 | (7.5, 7.9] | 8.0 |
5850 rows × 3 columns
df1 = pd.DataFrame()
for i in range(0,len(Netflix_Data['imdb_score_cat_1'].unique())):
df2 = Netflix_Data.loc[ Netflix_Data.imdb_score_cat_1 == i , ['imdb_score_cat_int_1','imdb_score_cat_1'] ].head(1)
df1 = pd.concat([df1 , df2], axis=0)df1| imdb_score_cat_int_1 | imdb_score_cat_1 | |
|---|---|---|
| 19 | (1.0, 5.0] | 0.0 |
| 29 | (5.0, 5.6] | 1.0 |
| 9 | (5.6, 6.0] | 2.0 |
| 16 | (6.0, 6.3] | 3.0 |
| 25 | (6.3, 6.6] | 4.0 |
| 23 | (6.6, 6.9] | 5.0 |
| 17 | (6.9, 7.2] | 6.0 |
| 10 | (7.2, 7.5] | 7.0 |
| 2 | (7.5, 7.9] | 8.0 |
| 1 | (7.9, 9.6] | 9.0 |
2.11.7 Using quantiles rule
def Categorization_quantiles_rule(Data, Variable_name, t):
intervals = []
for q in np.arange(0, 1, step=t) :
intervals.append( Data[Variable_name].quantile(q))
intervals[0] = intervals[0] - 0.5
intervals.append(Data[Variable_name].quantile(1))
Data[Variable_name + '_cat_interval'] = pd.cut(x=Data[Variable_name], bins=intervals)
Data[Variable_name + '_cat'] = pd.cut(x=Data[Variable_name], bins=intervals, labels=False)
Categorization_quantiles_rule(Data=Netflix_Data, Variable_name='imdb_score', t=0.05)Netflix_Data.loc[: , ['imdb_score','imdb_score_cat_int','imdb_score_cat']]| imdb_score | imdb_score_cat_int_2 | imdb_score_cat_2 | |
|---|---|---|---|
| 0 | NaN | NaN | NaN |
| 1 | 8.2 | (7.9, 8.2] | 18.0 |
| 2 | 7.7 | (7.5, 7.7] | 16.0 |
| 3 | 8.2 | (7.9, 8.2] | 18.0 |
| 4 | 7.7 | (7.5, 7.7] | 16.0 |
| … | … | … | … |
| 5845 | 6.8 | (6.6, 6.8] | 10.0 |
| 5846 | 7.7 | (7.5, 7.7] | 16.0 |
| 5847 | 3.8 | (1.0, 4.4] | 0.0 |
| 5848 | NaN | NaN | NaN |
| 5849 | 7.8 | (7.7, 7.9] | 17.0 |
5850 rows × 3 columns
df1 = pd.DataFrame()
for i in range(0,len(Netflix_Data['imdb_score_cat_2'].unique())):
df2 = Netflix_Data.loc[ Netflix_Data.imdb_score_cat == i , ['imdb_score_cat_int_2','imdb_score_cat_2'] ].head(1)
df1 = pd.concat([df1 , df2], axis=0)df1 | imdb_score_cat_int_2 | imdb_score_cat_2 | |
|---|---|---|
| 19 | (1.0, 4.4] | 0.0 |
| 33 | (4.4, 5.0] | 1.0 |
| 29 | (5.0, 5.3] | 2.0 |
| 74 | (5.3, 5.6] | 3.0 |
| 9 | (5.6, 5.8] | 4.0 |
| 69 | (5.8, 6.0] | 5.0 |
| 16 | (6.0, 6.2] | 6.0 |
| 79 | (6.2, 6.3] | 7.0 |
| 25 | (6.3, 6.5] | 8.0 |
| 31 | (6.5, 6.6] | 9.0 |
| 23 | (6.6, 6.8] | 10.0 |
| 44 | (6.8, 6.9] | 11.0 |
| 46 | (6.9, 7.1] | 12.0 |
| 17 | (7.1, 7.2] | 13.0 |
| 11 | (7.2, 7.3] | 14.0 |
| 10 | (7.3, 7.5] | 15.0 |
| 2 | (7.5, 7.7] | 16.0 |
| 32 | (7.7, 7.9] | 17.0 |
| 1 | (7.9, 8.2] | 18.0 |
| 5 | (8.2, 9.6] | 19.0 |
\(\\\)
Categorization_quantiles_rule(Data=Netflix_Data, Variable_name='imdb_score', t=0.25)Netflix_Data.loc[: , ['imdb_score','imdb_score_cat_interval','imdb_score_cat']]| imdb_score | imdb_score_cat_interval | imdb_score_cat | |
|---|---|---|---|
| 0 | NaN | NaN | NaN |
| 1 | 8.2 | (7.3, 9.6] | 3.0 |
| 2 | 7.7 | (7.3, 9.6] | 3.0 |
| 3 | 8.2 | (7.3, 9.6] | 3.0 |
| 4 | 7.7 | (7.3, 9.6] | 3.0 |
| … | … | … | … |
| 5845 | 6.8 | (6.6, 7.3] | 2.0 |
| 5846 | 7.7 | (7.3, 9.6] | 3.0 |
| 5847 | 3.8 | (1.0, 5.8] | 0.0 |
| 5848 | NaN | NaN | NaN |
| 5849 | 7.8 | (7.3, 9.6] | 3.0 |
5850 rows × 3 columns
df1 = pd.DataFrame()
for i in range(0,len(Netflix_Data['imdb_score_cat'].unique())):
df2 = Netflix_Data.loc[ Netflix_Data.imdb_score_cat == i , ['imdb_score_cat_interval','imdb_score_cat'] ].head(1)
df1 = pd.concat([df1 , df2], axis=0)df1| imdb_score_cat_interval | imdb_score_cat | |
|---|---|---|
| 9 | (1.0, 5.8] | 0.0 |
| 16 | (5.8, 6.6] | 1.0 |
| 11 | (6.6, 7.3] | 2.0 |
| 1 | (7.3, 9.6] | 3.0 |
\(\\\)
Categorization_quantiles_rule(Data=Netflix_Data, Variable_name='imdb_score', t=0.1)Netflix_Data.loc[: , ['imdb_score','imdb_score_cat_interval','imdb_score_cat']]| imdb_score | imdb_score_cat_interval | imdb_score_cat | |
|---|---|---|---|
| 0 | NaN | NaN | NaN |
| 1 | 8.2 | (7.9, 9.6] | 9.0 |
| 2 | 7.7 | (7.5, 7.9] | 8.0 |
| 3 | 8.2 | (7.9, 9.6] | 9.0 |
| 4 | 7.7 | (7.5, 7.9] | 8.0 |
| … | … | … | … |
| 5845 | 6.8 | (6.6, 6.9] | 5.0 |
| 5846 | 7.7 | (7.5, 7.9] | 8.0 |
| 5847 | 3.8 | (1.0, 5.0] | 0.0 |
| 5848 | NaN | NaN | NaN |
| 5849 | 7.8 | (7.5, 7.9] | 8.0 |
5850 rows × 3 columns
df1 = pd.DataFrame()
for i in range(0,len(Netflix_Data['imdb_score_cat'].unique())):
df2 = Netflix_Data.loc[ Netflix_Data.imdb_score_cat == i , ['imdb_score_cat_interval','imdb_score_cat'] ].head(1)
df1 = pd.concat([df1 , df2], axis=0)df1| imdb_score_cat_interval | imdb_score_cat | |
|---|---|---|
| 19 | (1.0, 5.0] | 0.0 |
| 29 | (5.0, 5.6] | 1.0 |
| 9 | (5.6, 6.0] | 2.0 |
| 16 | (6.0, 6.3] | 3.0 |
| 25 | (6.3, 6.6] | 4.0 |
| 23 | (6.6, 6.9] | 5.0 |
| 17 | (6.9, 7.2] | 6.0 |
| 10 | (7.2, 7.5] | 7.0 |
| 2 | (7.5, 7.9] | 8.0 |
| 1 | (7.9, 9.6] | 9.0 |
2.12 Dummification of Categorical Variables
Given a categorical statistical variable \(\hspace{0.06cm}\mathcal{X}_k\hspace{0.05cm}\) with range \(\hspace{0.06cm}R(\mathcal{X}_k)=\lbrace g_1 ,..., g_h \rbrace\hspace{0.06cm}\) , and a sample \(\hspace{0.05cm}X_k \hspace{0.05cm} = \hspace{0.05cm} \left( \hspace{0.01cm} x_{1k} \hspace{0.01cm} , \hspace{0.01cm}x_{2k}\hspace{0.01cm},\dots ,\hspace{0.01cm} x_{nk} \hspace{0.01cm}\right)^t\hspace{0.05cm}\) of that statistical variable.
Dummify \(\hspace{0.06cm}X_k\hspace{0.06cm}\) consist of obtaining the new dummy samples \(\hspace{0.06cm}X_{k\hspace{0.06cm}0}\hspace{0.06cm},\hspace{0.06cm}X_{k\hspace{0.06cm}1}\hspace{0.06cm},...,\hspace{0.06cm}X_{k \hspace{0.06cm}h-1}\hspace{0.06cm}\), where \(\hspace{0.06cm}X_{kj}\hspace{0.06cm}\) is defined as:
\[x_{i,kj} = \left\lbrace\begin{array}{l} 1 \hspace{0.3cm} , \hspace{0.3cm} \text{ if} \hspace{0.3cm} x_{ik} = j \\[0.1cm] 0 \hspace{0.25cm}, \hspace{0.3cm} \text{ if} \hspace{0.3cm} x_{ik} \neq j \end{array}\right.\]
for \(\hspace{0.06cm}j\in \lbrace 0\hspace{0.06cm},\hspace{0.06cm}1\hspace{0.06cm},\dots ,\hspace{0.06cm}h-1 \rbrace\)
2.12.1 Dummification of Categorical Variables in Python
def dummies(Data, Variable_name, drop_first=False):
df_dummies = pd.get_dummies(Data[Variable_name], drop_first=drop_first)
return df_dummiesdummies(Data=Netflix_Data, Variable_name='type')| MOVIE | SHOW | |
|---|---|---|
| 0 | 0 | 1 |
| 1 | 1 | 0 |
| 2 | 1 | 0 |
| 3 | 1 | 0 |
| 4 | 1 | 0 |
| … | … | … |
| 5845 | 1 | 0 |
| 5846 | 1 | 0 |
| 5847 | 1 | 0 |
| 5848 | 1 | 0 |
| 5849 | 0 | 1 |
5850 rows × 2 columns
dummies(Data=Netflix_Data, Variable_name='age_certification')| G | NC-17 | PG | PG-13 | R | TV-14 | TV-G | TV-MA | TV-PG | TV-Y | TV-Y7 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| … | … | … | … | … | … | … | … | … | … | … | … |
| 5845 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5846 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5847 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5848 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5849 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5850 rows × 11 columns
dummies(Data=Netflix_Data, Variable_name='age_certification', drop_first=True)| NC-17 | PG | PG-13 | R | TV-14 | TV-G | TV-MA | TV-PG | TV-Y | TV-Y7 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| … | … | … | … | … | … | … | … | … | … | … |
| 5845 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5846 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5847 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5848 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5849 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5850 rows × 10 columns
3 Statistical Description
3.1 Statistical variable
A statistical variable \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\) can be modeled as a random variable.
Under this approach, we can apply all probability theory on random variables to statistical variables. \(\\[0.4cm]\)
3.2 Range of a statistical variable
The range of a statistical variable \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\) is denoted by \(\hspace{0.05cm}Range(\mathcal{X}_k)\hspace{0.05cm}\), and is defined as the set of possible values of \(\hspace{0.05cm}\mathcal{X}_k\). \(\\[0.4cm]\)
3.2.1 Statistical variable types: quantitative and categorical
The variable \(\mathcal{X}_k\) is quantitative if the elements of it´s range are conceptually numbers. \(\\[0.5cm]\)
The variable \(\mathcal{X}_k\) is categorical if the elements of it´s range aree labels or categories (they can be numbers at a symbolic level but not at a conceptual level). \(\\[0.4cm]\)
3.2.2 Quantitative variable types: continuous and discrete
We can distinguish at least two types of quantitative variables: continuous and discrete.
\(\mathcal{X}_k\hspace{0.05cm}\) is continuous if \(\hspace{0.05cm}Range(\mathcal{X}_k)\hspace{0.05cm}\) is a not countable set. \(\\[0.5cm]\)
\(\mathcal{X}_k\hspace{0.05cm}\) is discrete if \(\hspace{0.05cm}Range(\mathcal{X}_k)\hspace{0.05cm}\) is countable set. \(\\[0.2cm]\)
Note:
In particular, variables whose range is a finite set will be discrete.
Variables whose range isn´t a finite set will be continuous. \(\\[0.4cm]\)
3.2.3 Categorical variable types: r-ary
Let \(\mathcal{X}_k\) a categorical variable ,
- \(\mathcal{X}_k\) is r-aria if it´s range has r elements that are categories or labels.
In Statistics binary (2-aria) categorical variables are particularly important. \(\\[0.4cm]\)
3.2.4 Categorical variable types: nominal and ordinal
Let \(\mathcal{X}_k\) a \(r\)-ary categorical variable.
\(\mathcal{X}_k\) is nominal if there is no ordering between the \(r\) categories of it’s range. \(\\[0.4cm]\)
\(\mathcal{X}_k\) is ordinal if there is ordering between the \(r\) categories of it’s range. \(\\[0.4cm]\)
3.3 Sample of a statistical variable
Given a statistical variable \(\hspace{0.05cm}\mathcal{X}_k\).
A sample of \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\) is a vector of values of \(\hspace{0.05cm}\mathcal{X}_k\), called observations.
Therefore:
\[ X_k \hspace{0.05cm} = \hspace{0.05cm} \begin{pmatrix} x_{1k} \\ x_{2k}\\ ... \\ x_{nk} \end{pmatrix} \hspace{0.05cm} = \hspace{0.05cm} \left( \hspace{0.01cm} x_{1k} \hspace{0.01cm} , \hspace{0.01cm}x_{2k}\hspace{0.01cm},\dots ,\hspace{0.01cm} x_{nk} \hspace{0.01cm}\right)^t \\ \]
is a sample of a statistical variable because is a vector with the values or observations of the variable \(\hspace{0.05cm} \mathcal{X}_k \hspace{0.05cm}\) for \(\hspace{0.05cm} n \hspace{0.05cm}\) elements or individuals of a sample.
Where: \(\hspace{0.1cm} x_{ik}\hspace{0.05cm}\) is the value \(\hspace{0.05cm} i\)-th observation of the variable \(\hspace{0.05cm} \mathcal{X}_k\). \(\\[0.4cm]\)
3.4 Arithmetic Mean
Given a quantitative statistical variable \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\), and a sample \(\hspace{0.05cm}X_k \hspace{0.05cm} = \hspace{0.05cm} \left( \hspace{0.01cm} x_{1k} \hspace{0.01cm} , \hspace{0.01cm}x_{2k}\hspace{0.01cm},\dots ,\hspace{0.01cm} x_{nk} \hspace{0.01cm}\right)^t\hspace{0.05cm}\) of that statistical variable.
The arithmetic mean of \(\hspace{0.05cm}X_k \hspace{0.05cm}\) is defined as: \(\\[0.3cm]\)
\[\overline{\hspace{0.05cm} X_k \hspace{0.05cm} } \hspace{0.1cm}=\hspace{0.1cm} \dfrac{1}{n} \cdot \sum_{i=1}^n \hspace{0.05cm} x_{ik}\] \(\\[0.4cm]\)
Properties:
Existence: the arithmetic mean of a sample \(X_k\) of a statisitcal variable \(\mathcal{X}_k\) always exist, for any \(X_{k}\).
Commutatividad: arithmetic mean isn’t affected by the order of the elements of the sample \(X_k\) .
\(\overline{X_k} + \overline{X_j} = \overline{X_k + X_j}\)
\(\overline{ a\cdot X_k + b} = a \cdot \overline{X_k} + b\) , for any \(a,b \in \mathbb{R}\)
3.5 Weighted Mean
Given a quantitative statistical variable \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\), and a sample \(\hspace{0.05cm}X_k \hspace{0.05cm} = \hspace{0.05cm} \left( \hspace{0.01cm} x_{1k} \hspace{0.01cm} , \hspace{0.01cm}x_{2k}\hspace{0.01cm},\dots ,\hspace{0.01cm} x_{nk} \hspace{0.01cm}\right)^t\hspace{0.05cm}\) of that statistical variable.
And given a weights for each observation of the variable \(\hspace{0.05cm} \mathcal{X}_k \hspace{0.2cm} \Rightarrow \hspace{0.2cm}\) \(w \hspace{0.05cm} = \hspace{0.05cm} (w_1,w_2,...,w_n)^t\)
The weighted mean of \(\hspace{0.05cm} X_k \hspace{0.05cm}\) with the weights vector \(\hspace{0.05cm} w \hspace{0.05cm}\) is defined as:
\[ \overline{X_k} (w) \hspace{0.1cm} = \hspace{0.1cm} \dfrac{1}{\hspace{0.1cm}\sum_{i=1}^{n} \hspace{0.05cm} w_{i} \hspace{0.1cm}} \hspace{0.05cm}\cdot\hspace{0.05cm} \sum_{i=1}^{n} \hspace{0.1cm} x_{ik} \cdot w_i \] \(\\[0.4cm]\)
3.6 Geometric Mean
Given the variable \(\hspace{0.05cm} X_k=(x_{1k}, x_{2k},...,x_{nk})^t\).
The geometric mean of the variable \(\hspace{0.05cm}X_k\hspace{0.05cm}\) is defined as: \(\\[0.3cm]\)
\[ \overline{X_k}_{geo} \hspace{0.05cm} = \hspace{0.05cm} \sqrt{\Pi_{i=1}^{n} x_{ik}} \hspace{0.05cm} = \hspace{0.05cm} \sqrt{x_{1k}\cdot x_{2k}\cdot...\cdot x_{nk}} \] \(\\[0.4cm]\)
3.7 Median
Given a statistical variable \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\), and a sample \(\hspace{0.05cm}X_k \hspace{0.05cm} = \hspace{0.05cm} \left( \hspace{0.01cm} x_{1k} \hspace{0.01cm} , \hspace{0.01cm}x_{2k}\hspace{0.01cm},\dots ,\hspace{0.01cm} x_{nk} \hspace{0.01cm}\right)^t\hspace{0.05cm}\) of that statistical variable.
The median of \(\hspace{0.05cm}X_k \hspace{0.05cm}\) is defined as a value \(Me(X_k)\) such that: \(\\[0.3cm]\)
\[\dfrac{1}{n} \cdot \sum_{i=1}^n \hspace{0.1cm} \mathbb{I} \hspace{0.05cm} \bigl[ \hspace{0.1cm} x_{ik} \hspace{0.05cm} \leq \hspace{0.05cm} Me(X_k) \hspace{0.1cm} \bigr] \hspace{0.1cm} = \hspace{0.1cm} 0.50\]
where: \(\hspace{0.15cm}\mathbb{I}\hspace{0.1cm}\) is the indicator function. \(\\[0.4cm]\)
Properties:
Existencia: La mediana siempre existe para cualquier conjunto de números.
Invariante a permutaciones: El orden de los números no afecta a la mediana.
No linealidad: La mediana de una suma de números no es igual a la suma de las medianas de cada conjunto de números.
Invarianza a la escala: Multiplicar todos los números por una constante no afecta la mediana.
Si se cumple \(median(cX_j) = c\cdot median(X_j)\) pero no es cierto en general que \(median(cX_j + b) = c\cdot median(X_j) + b\)
3.8 Mode
Given a categorical variable \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\), and a sample \(\hspace{0.05cm}X_k \hspace{0.05cm} = \hspace{0.05cm} \left( \hspace{0.01cm} x_{1k} \hspace{0.01cm} , \hspace{0.01cm}x_{2k}\hspace{0.01cm},\dots ,\hspace{0.01cm} x_{nk} \hspace{0.01cm}\right)^t\hspace{0.05cm}\) of that statistical variable.
The mode of \(\hspace{0.05cm} X_k \hspace{0.05cm}\) is the most repeated value in \(\hspace{0.05cm}X_k \hspace{0.05cm} = \hspace{0.05cm} \left( \hspace{0.01cm} x_{1k} \hspace{0.01cm} , \hspace{0.01cm}x_{2k}\hspace{0.01cm},\dots ,\hspace{0.01cm} x_{nk} \hspace{0.01cm}\right)^t\hspace{0.05cm}\), so, the mode of \(\hspace{0.05cm} X_k \hspace{0.05cm}\) is the most frequent value of \(\hspace{0.05cm} X_k \hspace{0.05cm} = \hspace{0.05cm} \left( \hspace{0.01cm} x_{1k} \hspace{0.01cm} , \hspace{0.01cm}x_{2k}\hspace{0.01cm},\dots ,\hspace{0.01cm} x_{nk} \hspace{0.01cm}\right)^t\hspace{0.05cm}\) \(\\[0.4cm]\)
3.9 Variance
Given a quantitative variable \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\), and a sample \(\hspace{0.05cm}X_k \hspace{0.05cm} = \hspace{0.05cm} \left( \hspace{0.01cm} x_{1k} \hspace{0.01cm} , \hspace{0.01cm}x_{2k}\hspace{0.01cm},\dots ,\hspace{0.01cm} x_{nk} \hspace{0.01cm}\right)^t\hspace{0.05cm}\) of that statistical variable.
The variance of \(\hspace{0.05cm} X_k \hspace{0.05cm}\) is defined as:
\[\sigma(X_k)^2 \hspace{0.1cm} = \hspace{0.1cm} \dfrac{1}{n} \cdot \sum_{i=1}^n \hspace{0.05cm} \left(\hspace{0.05cm} x_{ik} - \overline{X_k} \hspace{0.05cm}\right)^2\]
The standard deviation or standard error of \(\hspace{0.05cm} X_k \hspace{0.05cm}\) is defined as:
\[\sqrt{ \sigma(X_k)^2 } \hspace{0.1cm} = \hspace{0.1cm} \dfrac{1}{n} \cdot \sum_{i=1}^n \left( \hspace{0.05cm} x_{ik} - \overline{X_k} \hspace{0.05cm} \right)\] \(\\[0.4cm]\)
3.10 Median Absolute Deviation
Given a statistical variable \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\), and a sample \(\hspace{0.05cm}X_k \hspace{0.05cm} = \hspace{0.05cm} \left( \hspace{0.01cm} x_{1k} \hspace{0.01cm} , \hspace{0.01cm}x_{2k}\hspace{0.01cm},\dots ,\hspace{0.01cm} x_{nk} \hspace{0.01cm}\right)^t\hspace{0.05cm}\) of that statistical variable.
The median absolute deviation (MAD) of \(\hspace{0.05cm} X_k \hspace{0.05cm}\) is defined as:
\[MAD(X_k) \hspace{0.1cm} = \hspace{0.1cm} Me \bigl( \hspace{0.1cm} \left| \hspace{0.05cm} X_k - Me(X_k) \hspace{0.05cm} \right| \hspace{0.1cm} \bigr) \hspace{0.1cm} = \hspace{0.1cm} Me \hspace{0.1cm} \Bigr[ \hspace{0.1cm} \left( \hspace{0.2cm} \left| \hspace{0.1cm} x_{ik} - Me(X_k) \hspace{0.1cm} \right| \hspace{0.15cm} : \hspace{0.15cm} i = 1,\dots,n \hspace{0.2cm} \right) \hspace{0.1cm} \Bigr]\] \(\\[0.4cm]\)
3.11 Quantiles
Given a statistical variable \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\), and a sample \(\hspace{0.05cm}X_k \hspace{0.05cm} = \hspace{0.05cm} \left( \hspace{0.01cm} x_{1k} \hspace{0.01cm} , \hspace{0.01cm}x_{2k}\hspace{0.01cm},\dots ,\hspace{0.01cm} x_{nk} \hspace{0.01cm}\right)^t\hspace{0.05cm}\) of that statistical variable.
The \(\hspace{0.05cm}q\)-order quantile of \(\hspace{0.05cm} X_k \hspace{0.05cm}\) is defined as a value \(Q(X_k , q)\) such that:
\[\dfrac{1}{n} \cdot \sum_{i=1}^n \hspace{0.1cm} \mathbb{I} \hspace{0.05cm} \bigl[ \hspace{0.1cm} x_{ik} \hspace{0.05cm} \leq \hspace{0.05cm} Q(\hspace{0.05cm} X_k \hspace{0.05cm},\hspace{0.05cm} q \hspace{0.05cm}) \hspace{0.1cm} \bigr] \hspace{0.1cm} = \hspace{0.1cm} q\]
where: \(\hspace{0.15cm}\mathbb{I}\hspace{0.1cm}\) is the indicator function. \(\\[0.3cm]\)
Observation:
The median is the 0.5-order quantile. \(\\[0.4cm]\)
3.12 Kurtosis
Given a quantitative variable \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\), and a sample \(\hspace{0.05cm}X_k \hspace{0.05cm} = \hspace{0.05cm} \left( \hspace{0.01cm} x_{1k} \hspace{0.01cm} , \hspace{0.01cm}x_{2k}\hspace{0.01cm},\dots ,\hspace{0.01cm} x_{nk} \hspace{0.01cm}\right)^t\hspace{0.05cm}\) of that statistical variable.
The kurtosis coefficient of \(\hspace{0.05cm} X_k \hspace{0.05cm}\) is defined as: \(\\[0.35cm]\)
\[ \Psi(X_k) = \dfrac{\mu_{4}}{\sigma(X_k)^{4}} \]
where:
\[ \mu_{4}\hspace{0.1cm} =\hspace{0.1cm} \frac{1}{n} \cdot \sum_{i=1}^{n} \hspace{0.05cm} x_{ik}^4 \\[0.3cm] \]
Propierties:
If \(\hspace{0.12cm}\Psi(X_k) \hspace{0.05cm} > \hspace{0.05cm} 3\hspace{0.08cm}\) \(\hspace{0.2cm}\Rightarrow\hspace{0.2cm}\) the distribution of \(\hspace{0.05cm} X_k \hspace{0.05cm}\) is more pointed and with longer tails than the normal distribution. \(\\[0.5cm]\)
If \(\hspace{0.12cm}\Psi(X_k) \hspace{0.05cm} < \hspace{0.05cm} 3\hspace{0.08cm}\) \(\hspace{0.2cm}\Rightarrow\hspace{0.2cm}\) the distribution of \(\hspace{0.05cm} X_k \hspace{0.05cm}\) is less pointed and with shorter tails than the normal distribution. \(\\[0.4cm]\)
3.13 Skewness
Given a quantitative variable \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\), and a sample \(\hspace{0.05cm}X_k \hspace{0.05cm} = \hspace{0.05cm} \left( \hspace{0.01cm} x_{1k} \hspace{0.01cm} , \hspace{0.01cm}x_{2k}\hspace{0.01cm},\dots ,\hspace{0.01cm} x_{nk} \hspace{0.01cm}\right)^t\hspace{0.05cm}\) of that statistical variable.
The skewness coefficient of \(\hspace{0.05cm} X_k \hspace{0.05cm}\) is defined as: \(\\[0.25cm]\)
\[ \Gamma(X_k) = \dfrac{\mu_{3}}{\sigma(X_k)^{3}} \]
where:
\[ \mu_{3}\hspace{0.1cm} =\hspace{0.1cm} \frac{1}{n} \cdot \sum_{i=1}^{n} \hspace{0.05cm} x_{ik}^3 \\[0.3cm] \]
Propierties:
Fisher’s skewness coefficient measures the degree of skewness in the distribution of a given statistical variable.
If \(\hspace{0.12cm} \Gamma(X_k) > 0\) \(\hspace{0.2cm} \Rightarrow \hspace{0.2cm}\) the distribution of \(X_k\) has skewness to the right. \(\\[0.6cm]\)
If \(\hspace{0.12cm} \Gamma(X_k) < 0\) \(\hspace{0.2cm} \Rightarrow \hspace{0.2cm}\) the distribution of \(X_k\) has skewness to the left. \(\\[0.4cm]\)
3.14 Outliers
There are several definitions of outlier, but here we are going to consider the classic one.
Given a statistical variable \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\), and a sample \(\hspace{0.05cm}X_k \hspace{0.05cm} = \hspace{0.05cm} \left( \hspace{0.01cm} x_{1k} \hspace{0.01cm} , \hspace{0.01cm}x_{2k}\hspace{0.01cm},\dots ,\hspace{0.01cm} x_{nk} \hspace{0.01cm}\right)^t\hspace{0.05cm}\) of that statistical variable.
For any \(\hspace{0.05cm} i\in \lbrace 1,...,n \rbrace\) ,
The observation \(\hspace{0.05cm} x_{ik}\hspace{0.05cm}\) of \(\hspace{0.05cm} \mathcal{X}_k\hspace{0.05cm}\) is an outlier if and only if:
\[x_{ik} \hspace{0.05cm} >\hspace{0.05cm} Q(X_k \hspace{0.05cm} , \hspace{0.05cm} 0.75) + 1.5\cdot IQR(X_k) \hspace{0.5cm}\text{or}\hspace{0.5cm} x_{ik} \hspace{0.05cm} <\hspace{0.05cm} Q(X_k \hspace{0.05cm} , \hspace{0.05cm} 0.25) - 1.5\cdot IQR(X_k) \\\]
where: \(\hspace{0.25cm} IQR(X_k) \hspace{0.12cm} = \hspace{0.12cm} Q(X_k \hspace{0.05cm} , \hspace{0.05cm} 0.75) \hspace{0.08cm} - \hspace{0.08cm} Q(X_k \hspace{0.05cm} , \hspace{0.05cm} 0.25) \hspace{0.25cm}\) is the interquartile range of \(\hspace{0.05cm} X_k \hspace{0.05cm}\).
3.15 Data Matrix
Given \(\hspace{0.05cm} p \hspace{0.05cm}\) statistical variables \(\hspace{0.05cm}\mathcal{X}_1, \mathcal{X}_2, \dots \mathcal{X}_p\hspace{0.05cm}\), and given a sample \(\hspace{0.05cm}X_k = (x_{1k},...,x_{nk})^t\hspace{0.05cm}\) of \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\) for each \(\hspace{0.05cm}k \in \lbrace 1,...,p \rbrace\).
A data matrix of the variables \(\hspace{0.05cm}\mathcal{X}_1,...,\mathcal{X}_1\hspace{0.05cm}\) would be: \(\\[0.35cm]\)
\[ X \hspace{0.05cm}=\hspace{0.05cm} \left( X_1 , X_2,\dots , X_p \right) \hspace{0.05cm}=\hspace{0.05cm} \begin{pmatrix} x_{1}^{t} \\ x_{2} ^t \\ ... \\ x_{n} ^t \end{pmatrix} \hspace{0.05cm}=\hspace{0.05cm} \begin{pmatrix} x_{11} & x_{12}&...&x_{1p}\\ x_{21} & x_{22}&...&x_{2p}\\ ...&...&...&...\\ x_{n1}& x_{n2}&...&x_{np} \end{pmatrix} \\ \]
where:
\(x_i ^t \hspace{0.05cm}=\hspace{0.05cm} \left( x_{i1}, x_{i2}, \dots , x_{ip} \right)\hspace{0.1cm}\) is the vector with the values of the \(\hspace{0.05cm} p \hspace{0.05cm}\) statistical variables \(\hspace{0.05cm}\mathcal{X}_1,\dots ,\mathcal{X}_p\hspace{0.05cm}\) for the \(\hspace{0.05cm}i\)-th element of the sample, for \(\hspace{0.05cm} i \in \lbrace 1,...,n \rbrace\) \(\\[0.4cm]\)
Observations:
\(X \hspace{0.1cm}\) is a matrix with \(\hspace{0.05cm}p\hspace{0.05cm}\) columns and \(\hspace{0.05cm}n\hspace{0.05cm}\) rows, so, is a matrix of size \(\hspace{0.05cm} p\hspace{0.05cm} \text{x}\hspace{0.05cm}n\). \(\\[0.4cm]\)
3.16 Covariance
Given the statistical variables \(\hspace{0.05cm}\mathcal{X}_1, \mathcal{X}_2, \dots \mathcal{X}_p\hspace{0.05cm}\), and given a sample \(\hspace{0.05cm}X_k = (x_{1k},...,x_{nk})^t\hspace{0.05cm}\) of \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\) for each \(\hspace{0.05cm}k \in \lbrace 1,...,p \rbrace\).
The covariance between \(\hspace{0.05cm}X_j\hspace{0.05cm}\) and \(\hspace{0.05cm}X_r\hspace{0.05cm}\) is defined as:
\[ S(X_k, X_r) \hspace{0.1cm}=\hspace{0.1cm} \frac{1}{n} \cdot \sum_{i=1}^{n} \left(\hspace{0.05cm} x_{ik} - \overline{X_k} \hspace{0.05cm}\right)\cdot \left(\hspace{0.05cm} x_{ir} - \overline{X_r} \hspace{0.05cm}\right) \] \(\\[0.4cm]\)
3.16.1 Properties of covariance
\(S(X_k,X_r) \in (-\infty, \infty)\) \(\\[0.5cm]\)
\(S(X_k,X_r) \hspace{0.1cm} = \hspace{0.1cm} \dfrac{1}{n}\cdot \sum_{i=1}^{n} (x_{ik} \cdot x_{ir}) \hspace{0.05cm} - \hspace{0.05cm} \overline{X_k} \cdot \overline{X_r} \hspace{0.1cm} = \hspace{0.1cm} \overline{X_k\cdot X_r} \hspace{0.05cm} - \hspace{0.05cm} \overline{x_k} \cdot \overline{x_r}\) \(\\[0.5cm]\)
\(S(X_k, a + b\cdot X_r) \hspace{0.1cm} = \hspace{0.1cm} b\cdot S(X_k,X_r)\) \(\\[0.5cm]\)
\(S(X_k,X_r) \hspace{0.1cm} = \hspace{0.1cm} S(X_r,X_k)\) \(\\[0.5cm]\)
\(S(X_k,X_r)\hspace{0.05cm} >\hspace{0.05cm} 0 \hspace{0.2cm} \Rightarrow \hspace{0.2cm}\) Positive Relationship between \(\hspace{0.05cm}X_k\hspace{0.05cm}\) and \(\hspace{0.05cm}X_r\hspace{0.05cm}\). \(\\[0.5cm]\)
\(S(X_k,X_r)\hspace{0.05cm} <\hspace{0.05cm} 0 \hspace{0.2cm} \Rightarrow \hspace{0.2cm}\) Negative Relationship between \(\hspace{0.05cm}X_k\hspace{0.05cm}\) and \(\hspace{0.05cm}X_r\hspace{0.05cm}\). \(\\[0.5cm]\)
\(S(X_k,X_r) \hspace{0.05cm}=\hspace{0.05cm} 0 \hspace{0.2cm} \Rightarrow \hspace{0.2cm}\) There is not relationship between \(\hspace{0.05cm}X_k\hspace{0.05cm}\) and \(\hspace{0.05cm}X_r\hspace{0.05cm}\). \(\\[0.5cm]\)
3.17 Covariance Matrix
The covariance matrix of a given data matrix \(\hspace{0.05cm}X \hspace{0.05cm}=\hspace{0.05cm} (X_1,...,X_p)\hspace{0.05cm}\) is: \(\\[0.2cm]\)
\[ S_X = \bigl( \hspace{0.2cm} s_{k,r} \hspace{0.05cm} : \hspace{0.05cm} k,r \in \lbrace 1,...,p \rbrace \hspace{0.2cm} \bigr) \]
where: \(\hspace{0.15cm} s_{k,r} = S(X_k , X_r)\) \(\\[0.25cm]\)
Matrix expression of the covariance matrix :
\[ S_X \hspace{0.1cm} = \hspace{0.1cm} \dfrac{1}{n} \cdot X\hspace{0.1cm}^t \cdot H \cdot X \]
where: \(\hspace{0.15cm} H \hspace{0.1cm}=\hspace{0.1cm} I_n \hspace{0.05cm} - \hspace{0.05cm} \dfrac{1}{n} \cdot 1_{nx1} \cdot 1^t_{nx1} \hspace{0.15cm}\) is the centered matrix
3.18 Correlation
Given the statistical variables \(\hspace{0.05cm}\mathcal{X}_1, \mathcal{X}_2, \dots \mathcal{X}_p\hspace{0.05cm}\), and given a sample \(\hspace{0.05cm}X_k = (x_{1k},...,x_{nk})^t\hspace{0.05cm}\) of \(\hspace{0.05cm}\mathcal{X}_k\hspace{0.05cm}\) for each \(\hspace{0.05cm}k \in \lbrace 1,...,p \rbrace\).
The Pearson linear correlation between the variables \(X_k\) and \(X_r\) is defined as:
\[ r(X_k,X_r) = \frac{S(X_k,X_r)}{S(X_k) \cdot S(X_r)} \] \(\\[0.25cm]\)
3.18.1 Properties of Pearson linear correlation
\(r(X_k,X_r) \in [-1,1]\) \(\\[0.5cm]\)
\(r_{X_k,a + b\cdot X_r} = r(X_k,X_r)\) \(\\[0.5cm]\)
The sign of \(r(X,X)\) is equal to the sign of \(S(X_k,Xr)\) \(\\[0.5cm]\)
$r(X_k,X_r) = $ perfecto linear relationship between \(X_k\) and \(X_r\). \(\\[0.5cm]\)
\(r(X_k,X_r) = 0 \hspace{0.1cm} \Rightarrow \hspace{0.1cm}\) There is not linear relationship between \(X_k\) and \(X_r\). \(\\[0.5cm]\)
\(r(X_k,X_r) \rightarrow \pm 1 \hspace{0.1cm} \Rightarrow \hspace{0.1cm}\) hard linear relationship between \(X_k\) and \(X_r\). \(\\[0.5cm]\)
\(r(X_k,X_r) \rightarrow 0 \hspace{0.1cm} \Rightarrow \hspace{0.1cm}\) weak linear relationship between \(X_k\) and \(X_r\). \(\\[0.5cm]\)
\(r(X_k,X_r) >0 \hspace{0.1cm} \Rightarrow \hspace{0.1cm}\) positive relationship between \(X_k\) and \(X_r\). \(\\[0.5cm]\)
\(r(X_k,X_r) <0 \hspace{0.1cm} \Rightarrow \hspace{0.1cm}\) negative relationship between \(X_k\) and \(X_r\). \(\\[0.5cm]\)
3.19 Pearson Correlation Matrix
The Pearson correlation matrix of the data matrix \(X=(X_1 ,..., X_p)\) is : \(\\[0.25cm]\)
\[ R_X =\bigl( \hspace{0.12cm} r_{k,r} \hspace{0.12cm} : \hspace{0.12cm} k,r\in \lbrace 1,...,p \rbrace \hspace{0.12cm} \bigr) \] \(\\[0.25cm]\)
where: \(\hspace{0.2cm} r_{i j} = r(X_i , X_j) \hspace{0.1cm}\) , for \(\hspace{0.12cm} i,j=1,...,p\) \(\\[0.35cm]\)
Matrix expression of the correlation matrix
\[ R_X= D_s^{-1} \cdot S_X \cdot D_s^{-1} \]
where: \[ D_s \hspace{0.05cm} = \hspace{0.05cm} \text{diag} \left( \hspace{0.05cm} \sigma(X_1) ,..., \sigma(X_p) \hspace{0.05cm} \right) \] \(\\[0.5cm]\)
3.20 Absolute Frequency
Given the statistical variables \(\hspace{0.07cm}\mathcal{X}_k\hspace{0.05cm}\), given a sample \(\hspace{0.05cm}X_k = (x_{1k},...,x_{nk})^t\hspace{0.07cm}\) of \(\hspace{0.07cm}\mathcal{X}_k\hspace{0.03cm}\).
3.20.1 Absolute Frequency of an element
Given \(\hspace{0.07cm} b \in Range(\mathcal{X}_k)\).
The absolute frequency of the element \(\hspace{0.07cm}b\hspace{0.07cm}\) in \(\hspace{0.07cm}X_k\hspace{0.07cm}\) is defined as :
\[ F_A(b ,X_k) \hspace{0.1cm}=\hspace{0.1cm} \# \hspace{0.05cm} \Bigl\{ \hspace{0.1cm} i \in \lbrace 1,... , n \rbrace \hspace{0.1cm} : \hspace{0.1cm} x_{ik}=b \hspace{0.1cm} \Bigl\} \]
Observation:
If \(\hspace{0.05cm}\) \(\mathcal{X}_k\) \(\hspace{0.05cm}\) is continuous, usually \(\hspace{0.05cm}\) \(F_A(b , X_k) = 0\) \(\hspace{0.05cm}\) for many values \(\hspace{0.05cm}\) \(b\) \(\\[0.4cm]\)
3.20.2 Absolute frequency of a set
Given \(\hspace{0.05cm}B \subset Range(\mathcal{X}_k)\)
The absolute frequency of the set \(\hspace{0.05cm}B\hspace{0.05cm}\) in \(\hspace{0.05cm}X_k\hspace{0.05cm}\) is defined as:
\[ F_A(B, X_k) = \sum_{b \in B} F_A(b , X_k ) = \]
Observation:
\(F_A([c_1,c_2], X_k)\) \(\hspace{0.08cm}\) is a particular case of \(\hspace{0.08cm}\) \(F_A(B, X_k)\) \(\hspace{0.08cm}\) with \(\hspace{0.08cm}\) \(A=[c_1,c_2]\) \(\\[0.4cm]\)
3.21 Relative Frequency
Given the statistical variables \(\hspace{0.07cm}\mathcal{X}_k\hspace{0.05cm}\), given a sample \(\hspace{0.05cm}X_k = (x_{1k},...,x_{nk})^t\hspace{0.07cm}\) of \(\hspace{0.07cm}\mathcal{X}_k\hspace{0.03cm}\).
3.21.1 Relative frequency of an element
Given \(\hspace{0.07cm}b \in Range(\mathcal{X}_k)\)
The relative frequency of the element \(\hspace{0.07cm}b\hspace{0.07cm}\) in \(\hspace{0.07cm}X_k\hspace{0.07cm}\) is defined as :
\[ F_{Re}(b,X_k) \hspace{0.07cm}=\hspace{0.07cm} \dfrac{F_A(b,X_k) }{n} \] \(\\[0.4cm]\)
3.21.2 Relative frequency of a set
Given \(\hspace{0.07cm}A \subset Range(\mathcal{X}_k)\).
The relative frequency of the set \(\hspace{0.07cm}B\hspace{0.07cm}\) in \(\hspace{0.07cm}X_k\hspace{0.07cm}\) is defined as:
\[ F_{Re}(A,X_k) \hspace{0.07cm}=\hspace{0.07cm} \dfrac{F_A(B ,X_k) }{n} \] \(\\[0.4cm]\)
3.22 Cumulative Absolute Frequency
The cumulative absolute frequency of the element \(b\) in \(X_k\) is defined as:
\[ F_{CumA}(b ,X_k) \hspace{0.07cm}= \hspace{0.07cm} F_A \left( \lbrace i=1,...,n \hspace{0.07cm} : \hspace{0.07cm} x_{ik} \leq b \rbrace , X_k \right) \] \(\\[0.4cm]\)
3.23 Cumulative Relative Frequency
The cumulative relative frequency of the element \(b\) in \(X_k\) is defined as:
\[ F_{CumRe}(b,X_k)= \dfrac{F_{CumA}(b,X_k)}{n} \] \(\\[0.4cm]\)
3.24 Frequency Table
A frequency table is a table that contains the absolut, relative and also cumulative frequencies of a statistical variable.
3.24.1 Frequency Table in Python
\(\\[0.4cm]\)
4 Statistical Description Protocol for Quantitative Variables
mean, median, variance, cuantiles, kurtosis, skewness, outliers
frequency tables –> https://www.statology.org/frequency-tables-python/
5 Statistical Description Protocol for Categorical Variables
mode, quantiles
frequency tables
6 Statistical Description Protocol for Variable Crossings (cruces de variables cuantis-categoricas, categroicas-categoricas, cuantis-cuantis)
quantitative-categorical –> mean, median, vaariance, quantiles etc BY GROUPS. Joint and conditional frequency tables.
categorical-categorical –> Joint and conditional frequency tables.
quantitative-quantitative –> transform to categorical-categorical case.
7 Statistical visualization
7.1 Visualization Protocol for Quantitative Variables
7.2 Visualization Protocol for Categorical Variables
7.3 Visualization Protocol for Quantitative-Categorical
7.4 Visualization Protocol for Categorical-Categorical
8 Descripción Estadistica Básica
A continuación vamos a realizar una descripción estadistica básica de las variables, a traves de dicersos estadisticos básicos.
8.1 Estadisticos básicos para las variables cuantitativas
Para las variables cuantitativas:
Netflix_Data.describe()| release_year | runtime | seasons | imdb_score | imdb_votes | tmdb_popularity | tmdb_score | |
|---|---|---|---|---|---|---|---|
| count | 5850.000000 | 5850.000000 | 2106.000000 | 5368.000000 | 5.352000e+03 | 5759.000000 | 5539.000000 |
| mean | 2016.417094 | 76.888889 | 2.162868 | 6.510861 | 2.343938e+04 | 22.637925 | 6.829175 |
| std | 6.937726 | 39.002509 | 2.689041 | 1.163826 | 9.582047e+04 | 81.680263 | 1.170391 |
| min | 1945.000000 | 0.000000 | 1.000000 | 1.500000 | 5.000000e+00 | 0.009442 | 0.500000 |
| 25% | 2016.000000 | 44.000000 | 1.000000 | 5.800000 | 5.167500e+02 | 2.728500 | 6.100000 |
| 50% | 2018.000000 | 83.000000 | 1.000000 | 6.600000 | 2.233500e+03 | 6.821000 | 6.900000 |
| 75% | 2020.000000 | 104.000000 | 2.000000 | 7.300000 | 9.494000e+03 | 16.590000 | 7.537500 |
| max | 2022.000000 | 240.000000 | 42.000000 | 9.600000 | 2.294231e+06 | 2274.044000 | 10.000000 |
8.2 Estadisticos básicos para las variables categóricas
Para las variables categóricas (no cuantitativas, en general) :
Netflix_Data.loc[: , ['title', 'description', 'age_certification', 'genres', 'production_countries' ]].describe()| title | description | age_certification | genres | production_countries | |
|---|---|---|---|---|---|
| count | 5849 | 5832 | 3231 | 5850 | 5850 |
| unique | 5798 | 5829 | 11 | 1726 | 452 |
| top | The Gift | Five families struggle with the ups and downs … | TV-MA | [‘comedy’] | [‘US’] |
| freq | 3 | 2 | 883 | 484 | 1959 |
8.3 Gráficos conjuntos para las variables cuantitativas
En esta seccion vamos a hacer un analisis gráfico básico de las variables cuantitativas, consideradas de manera conjunta.
Cargamos las librerias necesarias para los gráficos:
import numpy as np
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt8.3.1 Histograma conjunto de las variables cuantitativas
Vamos a generar un grafico con un histograma para cada una de las variables cuantitativas:
fig, axs = plt.subplots(3, 3, figsize=(11, 11))
p1 = sns.histplot(data=Netflix_Data, x="release_year", stat="proportion", bins=15, color="skyblue", ax=axs[0, 0])
p2 = sns.histplot(data=Netflix_Data, x="runtime", stat="proportion", bins=15, color="olive", ax=axs[0, 1])
p2.axes.set(xlabel='runtime', ylabel=' ')
p3 = sns.histplot(data=Netflix_Data, x="seasons", stat="proportion", bins=15, color="blue", ax=axs[0, 2])
p3.axes.set(xlabel='seasons', ylabel=' ')
p4 = sns.histplot(data=Netflix_Data, x="imdb_score", stat="proportion", bins=15, color="teal", ax=axs[1, 0])
p4.axes.set(xlabel='imdb_score', ylabel=' ')
p5 = sns.histplot(data=Netflix_Data, x="imdb_votes", stat="proportion", bins=15, color="purple", ax=axs[1, 1])
p5.axes.set(xlabel='imdb_votes', ylabel=' ')
p6 = sns.histplot(data=Netflix_Data, x="tmdb_popularity", stat="proportion", bins=15, color="pink", ax=axs[1, 2])
p6.axes.set(xlabel='tmdb_popularity', ylabel=' ')
p7 = sns.histplot(data=Netflix_Data, x="tmdb_score", stat="proportion", bins=15, color="red", ax=axs[2, 0])
p7.axes.set(xlabel='tmdb_score', ylabel=' ')
fig.savefig('p1.png', format='png', dpi=1200)
plt.show()Histograma conjunto de las variables cuantitativas
8.3.2 Box-Plot conjunto de las variables cuantitativas
Vamos a generar un grafico con un box-plot para cada una de las variables cuantitativas:
fig, axs = plt.subplots(3, 3, figsize=(11, 11))
p1 = sns.boxplot(data=Netflix_Data, x="release_year", color="skyblue", ax=axs[0, 0])
p2 = sns.boxplot(data=Netflix_Data, x="runtime", color="olive", ax=axs[0, 1])
p2.axes.set(xlabel='runtime', ylabel=' ')
p2.set_xticks( range(int(Netflix_Data['runtime'].min()) , int(Netflix_Data['runtime'].max()) , 100) )
p2.set_yticks( np.arange(0, 1, 0.1) )
p3 = sns.boxplot(data=Netflix_Data, x="seasons", color="blue", ax=axs[0, 2])
p3.axes.set(xlabel='seasons', ylabel=' ')
p4 = sns.boxplot(data=Netflix_Data, x="imdb_score", color="teal", ax=axs[1, 0])
p4.axes.set(xlabel='imdb_score', ylabel=' ')
p4.set_xticks( range(int(Netflix_Data['imdb_score'].min()) , int(Netflix_Data['imdb_score'].max()) , 300) )
p4.set_yticks( np.arange(0, 1, 0.1) )
p5 = sns.boxplot(data=Netflix_Data, x="imdb_votes", color="purple", ax=axs[1, 1])
p5.axes.set(xlabel='imdb_votes', ylabel=' ')
p5.set_xticks( range(int(Netflix_Data['imdb_votes'].min()) , int(Netflix_Data['imdb_votes'].max()/2) , 500000) )
p5.set_yticks( np.arange(0, 1, 0.1) )
p6 = sns.boxplot(data=Netflix_Data, x="tmdb_popularity", color="pink", ax=axs[1, 2])
p6.axes.set(xlabel='tmdb_popularity', ylabel=' ')
p6.set_xticks( range(int(Netflix_Data['tmdb_popularity'].min()) , int(Netflix_Data['tmdb_popularity'].max()+1) , 1000) )
p6.set_yticks( np.arange(0, 1, 0.1) )
p7 = sns.boxplot(data=Netflix_Data, x="tmdb_score", color="red", ax=axs[2, 0])
p7.axes.set(xlabel='tmdb_score', ylabel=' ')
p7.set_xticks( range(int(Netflix_Data['tmdb_score'].min()) , int(Netflix_Data['tmdb_score'].max()+1) , 2) )
p7.set_yticks( np.arange(0, 1, 0.1) )
plt.show()Box-Plot conjunto de las variables cuantitativas
8.3.3 Empirical-Cumulative-Distribution-Function-Plot conjunto de las variables cuantitativas
Vamos a generar un grafico con un ECDF-plot para cada una de las variables cuantitativas:
fig, axs = plt.subplots(3, 3, figsize=(11, 11))
p1 = sns.ecdfplot(data=Netflix_Data, x="release_year", color="skyblue", ax=axs[0, 0])
p1.set_xticks( range(int(Netflix_Data['release_year'].min()) , int(Netflix_Data['release_year'].max()+20) , 20) )
p1.set_yticks( np.arange(0, 1, 0.1) )
p2 = sns.ecdfplot(data=Netflix_Data, x="runtime", color="olive", ax=axs[0, 1])
p2.axes.set(xlabel='runtime', ylabel=' ')
p2.set_xticks( range(int(Netflix_Data['runtime'].min()) , int(Netflix_Data['runtime'].max()) , 100) )
p2.set_yticks( np.arange(0, 1, 0.1) )
p3 = sns.ecdfplot(data=Netflix_Data, x="seasons", color="blue", ax=axs[0, 2])
p3.axes.set(xlabel='seasons', ylabel=' ')
p3.set_xticks( range(int(Netflix_Data['seasons'].min()) , int(Netflix_Data['seasons'].max()) , 4) )
p3.set_yticks( np.arange(0, 1, 0.1) )
p4 = sns.ecdfplot(data=Netflix_Data, x="imdb_score", color="teal", ax=axs[1, 0])
p4.axes.set(xlabel='imdb_score', ylabel=' ')
p4.set_xticks( range(int(Netflix_Data['imdb_score'].min()) , int(Netflix_Data['imdb_score'].max()) , 300) )
p4.set_yticks( np.arange(0, 1, 0.1) )
p5 = sns.ecdfplot(data=Netflix_Data, x="imdb_votes", color="purple", ax=axs[1, 1])
p5.axes.set(xlabel='imdb_votes', ylabel=' ')
p5.set_xticks( range(int(Netflix_Data['imdb_votes'].min()) , int(Netflix_Data['imdb_votes'].max()/2) , 500000) )
p5.set_yticks( np.arange(0, 1, 0.1) )
p6 = sns.ecdfplot(data=Netflix_Data, x="tmdb_popularity", color="pink", ax=axs[1, 2])
p6.axes.set(xlabel='tmdb_popularity', ylabel=' ')
p6.set_xticks( range(int(Netflix_Data['tmdb_popularity'].min()) , int(Netflix_Data['tmdb_popularity'].max()+1) , 1000) )
p6.set_yticks( np.arange(0, 1, 0.1) )
p7 = sns.ecdfplot(data=Netflix_Data, x="tmdb_score", color="red", ax=axs[2, 0])
p7.axes.set(xlabel='tmdb_score', ylabel=' ')
p7.set_xticks( range(int(Netflix_Data['tmdb_score'].min()) , int(Netflix_Data['tmdb_score'].max()+1) , 50) )
p7.set_yticks( np.arange(0, 1, 0.1) )
plt.show()ECDF-Plot conjunto de las variables cuantitativas
8.4 Gráficos conjuntos para las variables categoricas
8.4.1 Bar-plot conjunto de las variables categóricas
Vamos a generar un grafico con un bar-plot para cada una de las variables categóricas, excepto para aquellas cuyo nº de categorias es excesivo, y por tanto no es práctico el gráfico:
fig, axs = plt.subplots(1, 2, figsize=(13, 6))
p1 = sns.countplot(x='type', data=Netflix_Data, ax=axs[0])
p1.set_xticklabels(['Movie', 'Show'])
p1.axes.set(xlabel='type', ylabel='count')
p2 = sns.countplot(x='age_certification', data=Netflix_Data, ax=axs[1])
plt.show()Bar-Plot conjunto de variables categoricas
9 Análisis Estadístico
En la sección anterior se hizo una descripción estadistica básica de las variables del data-set con el que estamos trabajando, pero no se ha hecho ningun analisis de los resultados obtenidos.
En esta seccion además de ampliar la descripción estadistica de los datos, se llevará a cabo un analisis de los resultados obtenidos.